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Preface 



As a young professor in 1997 I taught my graduate course in Stochastic Image Pro- 
cessing for the first time. Looking back on my rough notes from that time, the course 
must have been a near impenetrable disaster for the graduate students enrolled, with 
a long list of errors, confusions, and bad notation. 

With every repetition the course improved, with significant changes to notation, con- 
tent, and flow. However, at the same time that a cohesive, large-scale form of the 
course took shape, the absence of any textbook covering this material became in- 
creasingly apparent. There are countless texts on the subjects of image processing, 
Kalman filtering, and signal processing, however precious little for random fields or 
spatial statistics. The few texts that do cover Gibbs models or Markov random fields 
tend to be highly mathematical research monographs, not well suited as a textbook 
for a graduate course. 

More than just a graduate course textbook, this text was developed with the goal of 
being a useful reference for graduate students working in the areas of image pro- 
cessing, spatial statistics, and random fields. In particular, there are many concepts 
which are known and documented in the research literature, which are useful for stu- 
dents to understand, but which do not appear in many textbooks. This perception is 
driven by my own experience as a PhD student, which would have been considerably 
simplified if I had had a text accessible to me addressing some of the following gaps: 

• FFT-based estimation (Section 8.3) 

• A nice, simple, clear description of multigrid (Section 9.2.5) 

• The inference of dynamic models from cross-statistics (Chapter 10) 

• A clear distinction and relationship between squared and unsquared kernels 
(Chapter 5) 



V 



VI Preface 

• A graphical summary relating Gibbs and Markov models (Figure 6.11) 

To facilitate the use of this textbook and the methods described within it, I am making 
available online (see page XV) much of the code which I developed for this text. This 
code, some colour figures, and (hopefully few) errata can be found from this book's 
home page: 

http : //ocho . u Waterloo . ca/book 

This text has benefited from the work, support, and ideas of a great many people. 
I owe a debt of gratitude to the countless researchers upon whose work this book 
is built, and who are listed in the bibliography. Please accept my apologies for any 
omissions. 

The contents of this book are closely aligned with my research interests over the 
past ten years. Consequently the work of a number of my former graduate students 
appears in some form in this book, and I would like to recognize the contributions of 
Simon Alexander, Wesley Campaigne, Gabriel Carballo, Michale Jamieson, Fu Jin, 
Fakhry Khellah, Ying Liu, and Azadeh Mohebi. 

I would like to thank my Springer editor, John Kimmel, who was an enthusiastic 
supporter of this text, and highly tolerant of my slow pace in writing. Thanks also 
to copy editor Valerie Greco for her careful examination of grammar and punctua- 
tion (and where any remaining eirors are mine, not hers). I would like to thank the 
anonymous reviewers, who read the text thoroughly and who provided exceptionally 
helpful constructive criticism. I would also like to thank the /icw-anonymous review- 
ers, friends and students who gave the text another look: Werner Fieguth, Betty Pries, 
Akshaya Mishra, Alexander Wong, Li Liu, and Gerald Mwangi. 

I would like to thank Christoph Garbe and Michael Winckler, the two people who 
coordinated my stay at the University of Heidelberg, where this text was completed. 
My thanks to the Deutscher Akademischer Austausch Dienst, the Heidelberg Gradu- 
ate School, and to the Heidelberg Collaboratory for Image Processing for supporting 
my visit. 

Many thanks and appreciation to Betty for encouraging this project, and to the kids 
in Appendix C for just being who they are. 



Waterloo, Ontario Paul Fieguth 

July, 2010 
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The following tables of nomenclature are designed to assist the reader in understand- 
ing the mathematical language used throughout this text. In the author's opinion this 
is of considerable value particularly for readers who seek to use the book as a ref- 
erence and need to be able to understand individual equations or sections without 
reading an entire chapter for context. Four sets of definitions follow: 

1. Basic syntax 

2. Mathematical functions 

3. Definitions of commonly-used variables 

4. Notation for spatial models 

Page references are given to provide a few examples of use and some context to the 
notation, but are in no way intended to be exhaustive. 

We limit ourselves here to just defining the notation. For an explanation of algebraic 
concepts (matrix transpose, eigendecomposition, inverse, etc.) the reader is referred 
to Appendix A. For an explanation of related statistical concepts (expectation, co- 
variance, etc.), see Appendix B. A brief overview of image processing can be found 
in Appendix C. Most of the spatial models are explained in Chapters 5 and 6. 



XVII 



XVIII Nomenclature 



Syntax 



Definition 



Page 


References 




16 


411 




13 


414 




16 


414 




15 


294 




299 


385 




13 


383 




31 


385 
392 




20 


386 


63 


386 


418 


142 


143 


151 
146 


146 


152 


166 


19 


253 


384 




143 


383 


133 


141 


166 




133 


146 
265 




22 


58 




64 


108 




41 


241 




201 


412 


68 


72 


294 


119 


181 


411 


42 


65 


411 


45 


179 


413 




15 


415 


15 


121 


203 




22 


59 


31 


63 


73 


145 


152 


268 




49 


355 


28 


141 


241 


37 


69 


74 


28 


63 


417 



a scalar, random variable 

a column vector, random vector 

en ith element of vector a 

g^ ith vector in a sequence 

dij z, jth element of matrix A 

A matrix 

A T matrix transpose 

A H matrix Hermitian (complex transpose) 

A ~ 1 matrix inverse 

\A\ matrix determinant 

A kernel corresponding to stationary matrix A 

A -1 kernel corresponding to A -1 , but A -1 ^ (A)~ 

A T kernel corresponding to A T , but A T / (A) T 



[A].. [A] nxl 
[a] 

a, a, A 

a 
a, a, A 

a 

P 

Pr(Q) 

p(x) 
p{x\y) 



real vector of length n 

real k x n array 

reordering of matrix to column vector 

reordering of column vector tonxm matrix 

reordering to m x ri2 x ... multidimensional array 



estimation error in a 
transformation of a, a, A 
given sample data of a 
estimation error covariance 

probability of some event Q 
probability density of x 
conditional probability density 



{. . .} a set 

| S | number of elements in set S 

x\\ , 1 1 A 1 1 vector norm for x, matrix norm for A 



© 



x P x vector squared-norm for x with respect to covariance P 
convolution kernel origin 



~ is distributed as . . . 

x~P x has covariance P; the mean is zero or not of interest 

x ~ (/i, P) x has mean \i and covariance P, distribution unknown 

x ~ A/"(/x, P) x is Gaussian with mean /x and covariance P 



Tab. Notation.l. Basic vector, matrix, and statistical syntax 



Nomenclature 



XIX 



Function Definition 



Page References 



sign(a) 
a mod b 

min(-) 

min xe x(0 
arg x min(- 

Ra(A) 

Nu(A) 
rank(A) 
dim(-) 

tr(A) 
det(A) 

«(A) 
diag(A) 

Diag(x) 



the sign of scalar a, sign (a) = 
division modulus (remainder) 



-1 a< 

a = 

1 a> 



the minimum value in a set 

the minimum value of a function over range X 

the value of x which minimizes the function 

the range space of A 
the null space of A 
the rank of A 
the dimension of a space 

the trace, the sum of the diagonal elements of A 

the determinant of A 

matrix condition number of A 

a vector containing the diagonal elements of A 

a diagonal matrix with x along the diagonal 

the identity matrix 



144 400 

155 262 392 

198 385 385 

401 

30 63 409 

19 53 385 

19 53 384 

53 385 

384 

247 387 395 

24 386 387 

26 104 246 

54 140 266 

165 264 294 

37 277 357 



FFT d 


the d-dimensional fast Fourier transform 


267 


361 428 


FFT- 1 


the d-dimensional inverse fast Fourier transform 




265 361 


WT 


the wavelet transform 


273 


347 428 


WT" 1 


the inverse wavelet transform 




275 290 





element-by-element matrix multiplication 


146 


165 266 





element-by-element matrix division 


146 


254 265 


* 


convolution 


142 


146 424 


© 


circular convolution 




427 428 


var(-) 


variance 




164 389 


cov(-) 


covariance 


42 


67 87 


E[] 


expectation 


42 


64 412 


Ea[] 


expectation over variable a, if otherwise ambiguous 




232 363 


= 


is equivalent to, identical to 




31 61 


A 


is defined as 




15 58 


: > > < 


inequalities, in positive-definite sense for matrices 


81 


100 388 



Tab. Notation.2. Mathematical functions and operations (see Appendix A) 



XX 



Nomenclature 



Symbol Definition 



b linear system target 

b estimator bias 

c random field clique 

d spatial dimensionality 

e error 

e { the ith unit vector: all zeros with a one in the 2th position 

/ forward problem 

g Markov random field model coefficient 

i,j general indices 

k,n,q matrix and vector dimensions 

m measurement 

p probability density 

r linear system residual 

s,£ time 

v measurement noise 

v eigenvector 

w dynamic process noise 

x,y spatial location or indices 

z system state 

A linear system, normal equations 

A dynamic model predictor 

B dynamic model stochastic weight 

C measurement model 

E expectation 

F Fourier transform 

F change of basis (forwards) 

G Markov random field model 

H energy function 

/ identity matrix 

J optimization criterion 

K estimator gain 

L constraints matrix 

M measurements field (multidimensional) 

N image or patch size 

P state covariance 

Q squared system constraints 

R measurement noise covariance 

S change of basis (backwards) 

T annealing temperature 

U, V orthogonal matrices 

W wavelet transform 

Z random field (multidimensional) 



Page 


References 


245 


293 


403 




65 


66 




193 


368 




262 


267 


54 


58 


298 
384 


13 


15 


30 




186 


202 




17 


37 


19 


140 


148 


13 


40 


58 


42 


65 


411 


298 


306 


314 




42 


86 


13 


40 


58 


249 


304 


396 


42 


86 


325 


35 


150 


221 


13 


40 


58 


245 


293 


403 


86 


143 


325 


42 


86 


325 


13 


40 


58 


42 


64 


412 




257 


263 


241 


247 


314 




186 


201 


192 


222 


371 


37 


277 


357 




198 


249 


97 


108 


330 


150 


157 


293 


170 


215 


267 


134 


214 


329 


40 


143 


160 




152 


152 


40 


63 


87 


246 


247 


314 




192 


368 


245 


250 


400 




277 


348 


133 


141 


327 



Tab. Notation.3. Symbol definitions 



Nomenclature XXI 



Symbol Definition 



a, (3, 7 constants 

(3 Gibbs inverse temperature 

5 Dirac delta 

5 small offset or perturbation 

e small amount 

ft matrix condition number 

model parameters 

A regularization parameter 

A eigenvalue 

\i mean 

v estimator innovations 

p correlation coefficient 

p spectral radius 

a standard deviation 

a singular value 

r time offset or period 

£ correlation length 

£ threshold 

r covariance square root 

A, U covariance 

\P state space or alphabet 

Q problem space (multidimensional lattice) 

E region subset operator 

B matrix banding structure 

C clique set 

M neighbourhood 

J\f(p, P) Gaussian distribution with mean }i, covariance P 

O(-) complexity order 

R real 

R n real vector of length n 

R fcxn real k x n array 



Page 


References 


41 


98 


164 


192 


228 


355 


87 


197 


200 


26 


160 


184 




21 


305 


26 


104 


246 




50 


170 


31 


35 


63 


304 


396 


420 


37 


67 


412 


95 


96 


110 




390 


413 




300 


398 


69 


102 


412 


27 


256 


400 




44 


111 




28 


124 


198 


300 


432 


104 


166 


401 


66 


93 


389 


119 


122 


181 


184 


193 


415 




167 


342 


144 


146 


167 




193 


194 


184 


189 


226 


28 


63 


417 




143 


326 

374 


19 


253 


384 




143 


383 



Z Gibbs partition function 



192 355 



0, 1 scalar constant zero, one 

0, 1 vector constants of all zeros, all ones 

0, 1 matrix constants of all zeros, all ones 



24 59 

19 45 72 

66 76 



Tab. Notation.3. Symbol definitions (cont'd) 



XXII Nomenclature 



Nonstationary Stationary 
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(Dense) (Kernel) 

A,B A,B Square root model (dynamic) 86 143 325 
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Introduction 



Images are all around us ! Inexpensive digital cameras, video cameras, computer web- 
cams, satellite imagery, and images off the Internet give us access to spatial imagery 
of all sorts. The vast majority of these images will be of scenes at human scales — 
pictures of animals / houses / people / faces and so on — relatively complex images 
which are not well described statistically or mathematically. Many algorithms have 
been developed to process / denoise / compress / segment such images, described 
in innumerable textbooks on image processing [36,54, 143, 174,210], and briefly 
reviewed in Appendix C. 

Somewhat less common, but of great research interest, are images which do allow 
some sort of mathematical characterization, and to which standard image-processing 
algorithms may not apply. In most cases we do not necessarily have images here, per 
se, but rather spatial datasets, with one or more measurements taken over a two- or 
higher-dimensional space. 

There are many important problems falling into this latter group of scientific im- 
ages, and where this text seeks to make a contribution. Examples abound throughout 
remote sensing (satellite data mapping, data assimilation, sea-ice / climate-change 
studies, land use), medical imaging (denoising, organ segmentation, anomaly detec- 
tion), computer vision (textures, image classification, segmentation), and other 2D / 
3D problems (groundwater, biological imaging, porous media, etc.). 

Although a great deal of research has been applied to scientific images, in most 
cases the resulting methods are not well documented in common textbooks, such 
that many experienced researchers will be unfamiliar with the use of the FFT method 
(Section 8.3) or of posterior sampling (Chapter 11), for example. 

The goal, then, of this text is to address methods for solving multidimensional in- 
verse problems. In particular, the text seeks to avoid the pitfall of being entirely 
mathematical / theoretical at one extreme, or primarily applied / algorithmic on the 
other, by deliberately developing the basic theory (Part I), the mathematical mod- 
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elling (Part II), and the algorithmic / numerical methods (Part III) of solving a given 
problem. 



Inverse Problems 

So, to begin, why would we want to solve an inverse problem? 

There are a great many spatial phenomena that a person might want to study ... 

• The salinity of the ocean surface as a function of position; 

• The temperature of the atmosphere as a function of position; 

• The height of the grass growing in your back yard, as a function of location; 

• The proportions of oil and water in an oil reservoir. 

In each of these situations, you aren't just handed a map of the spatial process you 
wish to study, rather you have to infer such a map from given measurements. These 
measurements might be a simple function of the spatial process (such as measuring 
the height of the grass using a ruler) or might be complicated nonlinear functions 
(such as microwave spectra for inferring temperature). 

The process by which measurements are generated from the spatial process is nor- 
mally relatively straightforward, and is referred to as & forward problem. More diffi- 
cult, then, is the inverse problem, discussed in detail in Chapter 2, which represents 
the mathematical inverting of the forward problem, allowing you to infer the process 
of interest from the measurements. A simple illustration is shown in Figure 1.1. 

Large Multidimensional Problems 

So why is it that we wish to study large multidimensional problems? 

The solution to linear inverse problems (see Chapter 3) is easily formulated analyt- 
ically, and even a nonlinear inverse problem can be reformulated as an optimization 
problem and solved. The challenge, then, is not the solving of inverse problems in 
principle, but rather actually solving them in practice. 

For example, the solution to a linear inverse problem involves a matrix inversion. As 
the problem is made larger and larger, eventually the matrix becomes computation- 
ally or numerically impossible to invert. However, this is not just an abstract limit 
— even a modest two-dimensional problem at a resolution of 1000 x 1000 pixels 
contains one million unknowns, which would require the inversion of a one-million 
by one-million matrix: completely unfeasible. 
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Measurements 




Forward 



Inverse 




Fig. 1.1. An inverse problem: You want a nice clear photo of a face, however your camera 
yields blurry measurements. To solve this inverse problem requires us to mathematically invert 
the forward process of blurring. 



Therefore even rather modestly sized two- and higher-dimensional problems become 
impossible to solve using straightforward techniques, yet these problems are very 
common. Problems having one million or more unknowns are littered throughout the 
fields of remote sensing, oceanography, medical imaging, and seismology, to name 
a few. 

To be clear, a problem is considered to be multidimensional if it is a function of 
two or more independent variables. These variables could be spatial (as in a two- 
dimensional image or a three-dimensional volume), spatio-temporal (such as a video, 
a sequence of two-dimensional images over time), or a function of other variables 
under our control. 



Multidimensional Methods versus Image Processing 



What is it that the great diversity of algorithms in the image processing literature 
cannot solve? 

The majority of images which are examined and processed in image processing are 
"real" images, pictures and scenes at human scales, where the images are not well 
described mathematically. Therefore the focus of image processing is on making 
relatively few explicit, mathematical assumptions about the image, and instead fo- 
cusing on the development of algorithms that perform image-related tasks (such as 
compression, segmentation, edge detection, etc.). 
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Face 



Grass 



Clouds 



Fig. 1.2. Which of these might be best characterized mathematically? Many natural phenom- 
ena, when viewed at an appropriate scale, have a behaviour which is sufficiently varied or 
irregular that it can be modelled via relatively simple equations, as opposed to a human face, 
which would need a rather complex model to be represented accurately. 



In contrast, of great research interest are images taken at microscopic scales (cells 
in a Petri dish, the crystal structure of stone or metal) or at macroscopic scales (the 
temperature distribution of the ocean or of the atmosphere, satellite imagery of the 
earth) which do, in general, allow some sort of mathematical characterization, as 
explored in Figure 1 .2. That is, the focus of this text is on the assumption or inference 
of rather explicit mathematical models of the unknown process. 

Next, in order to be able to say something about a problem, we need measurements 
of it. These measurements normally suffer from one of three issues, any one of which 
would preclude the use of standard image-processing techniques: 



1 . For measurements produced by a scientific instrument, acquiring a measurement 
normally requires time and/or money, therefore the number of measurements is 
constrained. Frequently this implies that the multidimensional problem of interest 
is only sparsely sampled, as illustrated in Figure 1.3. 

There exist many standard methods to interpolate gaps in a sequence of data, how- 
ever standard interpolation knows nothing about the underlying phenomenon be- 
ing studied. That is, surely a grass-like texture should be interpolated differently 
from a map of ocean- surface temperature. 

2. Most measurements are not exact, but suffer from some degree of noise. Ideally 
we would like to remove this noise, to infer a more precise version of the under- 
lying multidimensional phenomenon. 

There exist many algorithms for noise reduction in images, however these are 
necessarily heuristic, because they are designed to work on photographic images, 
which might contain images of faces / cars / trees and the like. Given a scientific 
dataset, surely we would wish to undertake denoising in a more systematic (ide- 
ally optimal) manner, somehow dependent on the behaviour of the underlying 
phenomenon. 
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Satellite Track Locations 





Longitude (East) Longitude 

S atellite Altimetry Ship-B ased Measurements 

(each point measures ocean height) (each point measures ocean temperature) 




Slices from a 3D MRI 
(each point represents the concentration of water in a block of concrete) 

Fig. 1.3. Multidimensional measurements: Three examples of two- or three-dimensional mea- 
surements which could not be processed by conventional means of image processing. The 
altimetric measurements are sparse, following the orbital path of a satellite; the ship-based 
measurements are irregular and highly sparse, based on the paths that a ship followed in tow- 
ing an instrument array; the MRI measurements are dense, but at poor resolution and with 
substantial noise. 



3. In many cases of scientific imaging, the raw measurement produced by an in- 
strument is not a direct measurement of the multidimensional field, but rather 
some function of it. For example, in Application 3 we wish to study atmospheric 
temperature based on radiometric measurements of microwave intensities: the air 
temperature and microwave intensity are indeed related, but are very different 
quantities. 

Standard methods in image processing normally assume that the measurements 
(possibly noisy, possibly blurred) form an image. However, having measurements 
being some complicated function of the field of interest (an inverse problem) is 
more subtle and requires a careful formulation. 
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Statistics and Random Fields 

What is it that makes a problem statistical, and why do we choose to focus on statis- 
tical methods? 

An interest in spatial statistics goes considerably beyond the modelling of phenom- 
ena which are inherently random. In particular, multidimensional random fields offer 
the following advantages: 

1 . Even if an underlying process is not random, in most cases measurements of the 
process are corrupted by noise, and therefore a statistical representation may be 
appropriate. 

2. Many processes exhibit a degree of irregularity or complexity that would be 
extremely difficult to model deterministically. Two examples are shown in Fig- 
ure 1.4; although there are physics which govern the behaviour of both of these 
examples (e.g., the Navier-Stokes differential equation for water flow) the models 
are typically highly complex, containing a great number of unknown parameters, 
and are computationally difficult to simulate. 

A random-fields approach, on the other hand, would implicitly approximate these 
complex models on the basis of observed statistics. 

A random field 1 X is nothing but a large collection of random variables arranged on 
some set of points (possibly a two- or three-dimensional grid, perhaps on a sphere, 
or perhaps irregularly distributed in a high-dimensional space). The random field is 
characterized by the statistical interrelationships between its random variables. 

The main problem associated with a statistical formulation is the computational com- 
plexity of the resulting solution. However, as we shall see, there exists a compre- 
hensive set of methods and algorithms for the manipulation and efficient solving of 
problems involving random fields. The development of this theory and of associated 
algorithms is the fundamental goal of this text. 

Specifically, the key problem explored in this text is representational and computa- 
tional efficiency in the solving of large problems. The question of efficiency is easily 
motivated: even a very modestly sized 256 x 256 image has 65 536 elements, and 
the glass beads image in Figure 1.4 contains in excess of 100 million elements! It 
comes as no surprise that a great part of the research into random fields involves the 
discovery or definition of implicit statistical forms which lead to effective or faith- 
ful representations of the true statistics, while admitting computationally efficient 
algorithms. 

Broadly speaking there are four typical problems associated with random fields 
[112]: 



1 Random variables, random vectors, and random fields are reviewed in Appendix B.l. 
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A Porous Medium of Packed Glass Beads 

3^ 




(Microscopic Data from M. Ioannidis, Dept. Chemical Engineering, University of Waterloo) 



Global Ocean Surface Temperature 




150 200 

Longitude 

Fig. 1.4. Two examples of phenomena which may be modelled via random fields: packed 
glass beads (top), and the ocean surface temperature (bottom). Alternatives to random fields 
do exist to model these phenomena, such as ballistics methods for the glass beads, and coupled 
differential equations for the ocean, however such approaches would be greatly more complex 
than approximating the observed phenomena on the basis of inferred spatial statistics. 



1. Representation: how is the random field represented and parametrized? 

2. Synthesis: how can we generate "typical" realizations of the random field? 

3. Parameter estimation: given a parametrized statistical model and sample image, 
how can we estimate the unknown parameters in the model? 
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4. Random fields estimation: given noisy observations of the random field, how can 
the unknown random field be estimated? 

All four of these issues are of interest to us, and are developed throughout the text. 

For each of these there are separate questions of formulation, 

How do I write down the equations that need to be solved? 
as opposed to those of solution, 

How do I actually find a solution to these equations? 

Part I of this text focuses mostly on the former question, establishing the mathemat- 
ical fundamentals that are needed to express a solution, in principle. This gives us a 
solution which we might call 

1. Brute Force: The direct implementation of the solution equations, irrespective 
of computational storage, complexity, and numerical robustness issues. 

Parts II and III then examine the latter question, seeking practical, elegant, or indirect 
solutions to the problems of interest. However, practical should not be interpreted to 
mean that the material is only of dry interest to the specialist sitting at a computer, 
about to develop a computer program. Many of the most fundamental ideas expressed 
in this text are particularly in Part II, where deep insights into the nature of spatial 
random fields are explored. 

A few kinds of efficient solutions, alternatives to the direct implementations from 
Part I, are summarized as follows: 

2. Dimensionality Reduction: Transforming a problem into one or more lower- 
dimensional problems. 

3. Change of Basis: A mathematical transformation of the problem which simpli- 
fies its computational or numerical complexity. 

4. Approximate Solution: An approximation to the exact analytical solution. 

5. Approximated Problem: Rather than solving the given problem, identifying a 
similar problem which can be solved exactly. 

6. Special Cases: Circumstances in which the statistics or symmetry of the problem 
gives rise to special, efficient solutions. 

These six points give a broad sense of what this text is about. 



1 Introduction 9 

Interpolation as a Multidimensional Statistical Problem 

We conclude the Introduction by developing a simple canonical example, to which 
we frequently refer throughout the text. We have chosen this problem because it 
is intuitive and simple to understand, yet possesses most of the features of a large, 
challenging, estimation problem. 

Suppose you had sparse, three-dimensional measurements of some scalar quantity, 
such as the temperature throughout some part of an ocean. You wish to produce a 
dense map (really, a volume) of the temperature, based on the observed measure- 
ments. 

Essentially this is an interpolation problem, in that we wish to take sparse mea- 
surements of temperature, and infer from them a dense grid of temperature values. 
However by interpolation we do not mean standard deterministic approaches such 
as linear, bilinear, or B-spline interpolation, in which a given set of points is deter- 
ministically projected onto a finer grid. Rather, we mean the statistical problem, in 
which we have a three-dimensional random field Z with associated measurements 
M, where the measurements are subject to noise V, such that 

rrii = Zj i +Vi, (1.1) 

where j% is an index, describing the location of the zth measurement. Thus (1.1) gives 
the forward model, which we wish to invert (Chapter 2). 

Given the definition of the inverse problem, we can formulate the analytical solution, 
depending on whether this is a static problem, a single snapshot in time (Chapter 3), 
or a more complicated time-dynamic problem, in which the temperature evolves and 
is estimated over time (Chapter 4). 

However, so far we haven't said anything about the mathematics or statistics gov- 
erning Z. What distinguishes statistical interpolation from deterministic methods, 
such as linear or bilinear interpolation, is the ability to take into account specific 
properties of Z (Chapter 5). Thus is Z smooth, on what length scales does it exhibit 
variability, and what happens at its boundaries? Furthermore, are the statistics of Z 
spatially stationary (not varying from one location to another) or not, and are the 
statistics best characterized by looking at correlations of Z or at the inverse correla- 
tions (Chapter 6)? Finally are there hidden underlying aspects to the problem, such 
that the model in one location may be different from that in another (Chapter 1)1 

If the problem is particularly large, would it be possible to collapse it along one 
dimension, or possibly to solve the problem in pieces, rather than as a whole? One 
could also imagine transforming the problem, for example using a Fourier or wavelet 
transform (Chapter 8). 

At this point we have determined what sort of problem we have, whether reduced in 
dimensionality, whether transformed, whether stationary. We are left with two basic 
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approaches for solving the inverse problem: we can convert the inverse problem to 
a linear system, and use one of a number of linear systems solvers (mostly iterative) 
to find the desired map (Chapter 9), or we could use a domain-decomposition ap- 
proach that tackles the problem row-by-row, column-by-column, block-by-block, or 
scale-by- scale (Chapter 10). We may also wish to understand the model better by 
generating random samples from it (Chapter 11). 



How to Read This Text 

The preceding interpolation example has been very short and many details are omit- 
ted, but it is hoped that it gives the reader a sense of the scope of the ideas developed 
in this text. The reader wishing to follow up on interpolation in more detail is encour- 
aged to move directly to the three interpolation examples developed in Chapter 2 on 
pages 20, 32, and 36. 

Those readers unfamiliar with the contents of this text may wish to survey the book 
by glancing through the worked applications at the end of every chapter, which cover 
a variety of topics in remote sensing and scientific imaging. These applications, and 
also the various examples throughout the text, are all listed beginning on page XIII. 

Any reader who wishes to explore multidimensional random fields and processes 
in some depth should focus on the chapters on inverse problems and modelling, 
Chapters 2, 5, 6, and 8, which form the core of this text. 

Readers who are interested in numerical implementations of the methods in this text 
should consult the list of Matlab 2 code samples on page XV. The code samples 
are cross-referenced to figures and examples throughout the text, and all of the listed 
samples are available online at 

http : //ocho . u Waterloo . ca/book 



2 Matlab ^ is a registered trademark of The Marh Works Inc. 
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Inverse Problems and Estimation 



Inverse Problems 



An understanding of forward and inverse problems [12, 301] lies at the heart of any 
large estimation problem. 

Abstractly, most physical systems can be defined or parametrized in terms of a set 
of attributes, or unknowns, from which other attributes, or measurements, can be 
inferred. In other words, the quantities m which we measure are some mathematical 
function 

m = f(z) (2.1) 

of other, more basic, underlying quantities z, where / may be deterministic or 
stochastic. In the special case when / is linear, a case of considerable interest to 
us, then (2.1) may be expressed as 

m = Cz or m — Cz + v (2.2) 

for the deterministic or stochastic cases, respectively. Normally z is an ideal, com- 
plete representation of the system: detailed, noise-free, and regularly structured (e.g., 
pixellated), whereas the measurements m are incomplete and approximate: possibly 
noise-corrupted, irregularly structured, limited in number, or somehow limited by 
the physics of the measuring device. 

The task of inferring or computing the measurements m from the detailed funda- 
mental quantities z is known as the forward problem. This is, by definition, an easy 
problem, since the relationship between the fundamental and measured quantities 
is given by some model /(). For example knowing z — the exact shape, size, and 
arrangement of all blood vessels, organs, bones, etc. in the body — makes it fairly 
easy to predict the appearance of m, a measured X-ray [183]. A variety of further 
examples is shown in Table 2.1. 

For deterministic systems, the algorithmic task of the forward problem is referred to 
as "simulation;" for the stochastic counterparts the task is known as "sample-path 
generation" or "sampling." 
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2 Inverse Problems 



Knowing . . . 



It is easy to determine 



The arrangement and sizes of all 
organs, bones, blood vessels etc. 



The appearance of a measured 
X-ray, MRI, CAT scan etc. 



The known masses and positions 
of stars in a cluster 



The shape of the cluster's gravi- 
tational field 



The 3D density of rock, magma, 
and minerals below the ocean 
floor 



The gravitationally induced 
shape of the surface of the ocean 



The underground distribution 
and layering of rock, clay, sand 



Groundwater and other pollutant 
flows 



A sharp image, in focus 



A blurry image, out of focus 



Table 2.1. Examples of forward problems: Generally, knowing some regular, well- structured, 
fundamental set of quantities (left) allows derived quantities (right) to be inferred relatively 
easily 



Fundamental 
Regularly Structured 
Complete, Accurate 



Forward Problem: 
Easy 



Inverse Problem: 
Hard 



Derived 

Typically Irregular 

Incomplete, Approximate 



Table 2.2. An inverse problem is, by definition, the difficult inversion of a comparatively- 
straightforward forward problem. Inverse problems are common because the available mea- 
surements in any given problem normally have the attributes on the right, whereas what we 
want are those are those on the left. 
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The task of inferring the reverse, that is, inferring the fundamental quantities z from 
the derived or measured ones m, is known as an inverse problem. Since inferring 
unknowns from measurements is a universal objective in experimental science, the 
solution of inverse problems is of vast interest and well-discussed in the literature 
[24, 27, 129, 140, 154, 189, 298, 304, 308-310]. It is also a hard problem, for two 
reasons: 

1 . The inverse function 

z = r 1 (m) (2.3) 

is normally not known, explicitly, from the mathematics or physics of the prob- 
lem. The relationship from measurements to fundamental quantities is known 
only implicitly, indirectly, through the forward problem. Continuing the previ- 
ous example, it takes considerable practice and skill to reconstruct z 9 the three- 
dimensional anatomy of a patient, from m, a sequence of observed X-ray images. 

If the forward problem is, indeed, invertible, then the inverse can be characterized 
as 

z = f- 1 (m)±{z\f(z)=m}, (2.4) 

that is, to try all possible values of z to find the one which produced the observed 
measurements m. Keep in mind that for annxn eight-bit greyscale image, the 
number of possible configurations for z is enormous: 

IU}| = (2 8 ) ( " 2) . (2.5) 

It is important to note, however, that some inverse problems can be solved by 
the repeated application of a forward problem, such as the patch-based methods 
in Section 1 1 .4, or in simple cases where a sequence z_ { can be found such that 
f(z.i) —> VOl, such as in Example 2.1. 

2. Normally the quantity or quality of the measurements is inadequate to allow a 
reconstruction of z, implying that the inverse function / _1 does not even exist. 

For example, a single X-ray image, as sketched in Figure 2.1, cannot tell a physi- 
cian whether a bone lies towards the front or the back of a body — both scenarios 
result in the same measurement. 

Given the basic structure of an inverse problem, as outlined above, there are four 
questions to discuss: 

1 . Can we generalize this idea of forward and inverse problems to more heteroge- 
neous problems? 

2. When is the inverse problem "well-posed"? That is, does a solution exist? 

3. When is the inverse problem "well-conditioned"? That is, will we find a mean- 
ingful solution, given a limited numerical accuracy in our computations? 
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Example 2.1: Root Finding is an Inverse Problem 



Given a nonlinear function 

m = f(z), 

finding the roots of /() is just an inverse problem: 

z = / -1 (0) or Find z such that f(z) = m = 0. 



(2.6) 



For this type of problem it is very common, in fact, for the inverse problem to be 
solved by repeated application of the forward problem. 

For example, the bisection method of numerical root finding examines the sign of 
f(z) for various z, starting with two points z\,z<z such that f{z\) • ffa) < 0: 



\ m = f(z) 






!U z 2 






(y i i 
Zl \\ \ / 


A 





For this inverse problem uniqueness fails — there are multiple solutions — there- 
fore the numerical solution will be a function of initialization: starting at (zi, Z2) 
will yield a different solution from starting at (z[, z^). 



4. Given a poorly-posed or poorly-conditioned inverse problem, how do we regu- 
larize it to allow a solution to be found? 

The following four sections discuss each of these, in sequence. 



2.1 Data Fusion 



Data fusion means many different things to different people and research disciplines. 
Invariably it implies some sort of fusion, or combining, of different pieces of infor- 
mation to infer some related quantity. 

In many contexts data fusion is associated with theories of evidence and belief, espe- 
cially Dempster-Shafer theory [283, 344], which seeks to separate notions of prob- 
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Hidden 
Structures 



Vertical 
I Projections j 

Available 
Measurements 




Projection m 1 



Projection m 2 



Fig. 2.1. The ambiguity of an inverse problem: One-dimensional projections (essentially line 
integrals) m 1 , m 2 are observed for two different two-dimensional structures Z\ , Z<i . Although 
the two-dimensional structures are clearly very different, they result in identical projections 
(although clearly a sideways projection would reveal the differences!). 



ability (which indicates a degree of likelihood) and belief (indicating the degree to 
which we are confident in the truth of some statement). 

In this book we do not use Dempster-Shafer theory or notions of belief; rather, we 
use data fusion to mean the use of multiple, different measurement sets, such as 
shown in Figure 2.2, in solving an inverse problem. Such problems occur widely in 
multisensor scenarios: 

• Remote sensing based on data from multiple satellite platforms 

• Multispectral data processing 

• Visual and infrared computer vision 

• Vision, acoustic, and tactile sensors in robotics 

The mathematical formulation of such a problem is no different from before: the 
multiple measurements 

m l = fiU) (2.7) 



are equivalent to a single stacked set of measurements 
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Fig. 2.2. Data fusion means the combining of multiple datasets, a problem encountered very 
frequently in medical imaging, remote sensing, and scientific image processing. Here a sin- 
gle soil sample is viewed in different polarizations, such that each image reveals different 
structures or attributes. 



m h 



fk(z) 



m 



(2.8) 



as in (2.1). 



The reason that studying data fusion is of practical interest is that the algorithmic 
approach to the stacked problem (2.8) may be considerably more difficult than the 
individual problems of (2.7). Specifically, the tractability of the individual problems 
of (2.7) may rely on specific properties of fa, such as locality, sparsity, or stationarity, 
properties which may be lost in the combined setting. 

To illustrate this point, consider the following examples from image processing: 
Forward Operation Given Measurements Problem Solution (Inversion) 



Blurring 
Added Noise 

Both 



Blurry Image 

Noisy Image 

J Some blurred images 
1 Some noisy images 



Deconvolution 
Denoising 



??? 



That is, standard methods in image processing (Appendix C), such as deconvolution 
or denoising, do not extrapolate to mixed, nonstationary, heterogeneous settings. Al- 
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though it is not possible to formulate a single computationally efficient generalized 
approach to heterogeneous inverse problems, we are nevertheless motivated to find 
approaches to such data fusion problems. 

In the following sections we do not rely on specific properties of the forward problem 
/; we will keep the data fusion approach in mind, and will think of / as referring to 
the heterogeneous case of (2.8). 



2.2 Posedness 



The question of posedness is one of invertibility. That is, does (2.1) admit a definition 
of the inverse function / _1 ? In the early 1900s, Hadamard [154] formulated the 
following three criteria to be satisfied in order for a problem to be well-posed: 

Existence: Every observation m has at least one corresponding value of z. 
Uniqueness: For every observation m, the solution for z is unique. 
Continuity: The dependence of the solution z on m is continuous. 

We can interpret the above in the context of our linear forward problem m = Cz 
from (2.2). Suppose that C is k x n (the reader is pointed to Appendix A.l if the 
matrix terminology used here is unfamiliar): 

i. If C does not have full row rank (always true if k > n) 
— > The rows of C are linearly dependent 
-> Thus Ra(C) C R k 

— > Therefore there exists meR k such that m £ Ra(C) 
— > Existence fails. 

ii. If C has full row rank (requires k < n) 
-► Ra(C) = R k 
— > Existence is satisfied. 

iii. If C does not have full column rank (always true if k < n) 
— > The columns of C are linearly dependent 

-> Thus Nu(C) + {0} 

— > There exists i G Nu(C) such that x/0 

-► It follows that Cz = C(z + x) 

— > Uniqueness fails. 

iv. If C has full column rank (requires k > n) 
-► Nu(C) = {0} 
— > Uniqueness is satisfied. 
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Example 2.2: Interpolation and Posedness 



Consider an inverse problem, characterized by the following forward model: 



* Z 2 Z 3 

i i x z 5 



mi 
O 



Forward 

m—Cz 



m 2 
9 



Standard linear interpolation would produce a straight-line sequence . . . 

O 



But there is actually nothing inherent in the forward model to allow us to claim 
this kind of straight-line behaviour. Algebraically, we would write the forward 
model as 

21 



m = Cz or 



mi 
m 2 



10000 
0000 1 



(2.9) 



Clearly Nu(C) / {0} since, for example, z 2 is not measured: 



C 



= 0. 



(2.10) 



Therefore uniqueness fails, and the problem of interpolation is ill-posed. 



v. If C has full row and column rank (requires k = n) 

— > Then C is invertible 

— > z = C _1 m offers a single solution for z (uniqueness), 
for every value of m (existence), 
and is a continuous function of m (continuity) 

— >• Therefore the problem is well-posed. 
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The overwhelming majority of image-analysis 1 problems are ill-posed; three simple 
examples are illustrated in Figure 2.3: 

Image Blurring: Although the blurred image may seem highly distorted from the 
original, if the exact nature of the blurring operation is known then it can, under 
certain circumstances, 2 be inverted perfectly. This problem may be well-posed. 

Image Sub sampling: Here the problem is to interpolate the missing portions of 
the image. However, with pieces of the image z not observed, uniqueness fails and 
the problem is ill-posed. 

Image Noise: Although the original image is clearly discernible through the noise, 
the original z is unknown and uniqueness fails. 

Two further examples illustrate failed continuity and existence: 

Edge Detection: Suppose we wish to estimate the edge process z(x) underlying 
an observed signal m(x) [27]. We propose the inverse function 

,(*)-£. (2.1.) 

If we consider two sets of observations 

m = g(z) and m! = g(z) + e sin(i?z), (2.12) 

then m and m! can be arbitrarily close for small e, however the resulting edge 
maps z, z' can be made arbitrarily far apart for large Q, thus continuity fails. 

Redundant Measurements: Suppose that we take multiple (redundant) mea- 
surements, for example, to reduce measurement uncertainty or noise. The nominal 
forward model 



mi 



m n 



(2.13) 



is ill-posed, because unless all of the measurements mi, ... , m n are identical, 
existence fails for z. 

Ill-posedness is not necessarily a terrible difficulty. Indeed, there are very simple, 
common problems, such as the linear regression illustrated in Example 2.3, which 
are ill-posed but for which solutions can be proposed. In general: 



1 Appendix C provides a review of image processing. 

2 Actually, the inversion is possible only for certain types of blur, and then only on an infinite 
domain, a finite domain with known boundary conditions, or on a periodic domain with a 
blur based on circular convolution. 
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Fig. 2.3. Three simple forward operators from image processing: Blurring /i, subsampling fi, 
and additive noise /3 (also see Appendix C). All three forward operations lead to differently- 
degraded images (centre column). Methods of deblurring, in-painting, and denoising are com- 
mon in the image processing literature, however formally fi is ill-posed and f\ , fz may or may 
not be well-posed, depending on the type of image blur and noise, respectively. The question 
of image deblurring is further examined in the context of problem conditioning, in Figure 2.4, 
and the question of image denoising in the context of regularization, in Figure 2.5. 



If Existence Fails: There is no z corresponding to the given m: the problem is 
overdetermined. We view m to have been perturbed by noise from its ideal value, 
so we choose that z which comes closest: 

Select an estimate z by minimizing \\m — Cz\\ for some norm || • ||. 



If Uniqueness Fails : For the given observation m there are infinitely many possi- 
ble solutions for z: the problem is underdetermined. We somehow need additional 
information on z, such as the regularization constraints and prior models discussed 
later in this chapter. A simple assertion is to select a small value for z: 

Select an estimate z by minimizing \\z\\ over those z satisfying m = Cz. 
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Example 2.3: Regression and Posedness 



In doing linear regression, we assert a model 

y = ax + b. 
Thus we have unknowns 



In a typical linear regression we are given many 
n ^> 2 data points, but have only two un- 
knowns a, 6. This is essentially an example of 
repeated measurements as in (2.13), therefore 
existence fails and the problem is ill-posed. 

This is easily seen in the second plot: no 
straight line exists which passes through all of 
the given points. 

However, we never expected the given data 
points to lie in a perfect line; instead, we seek 
an estimate 

a 



which passes as closely as possible to the given 
points, minimizing 



n , 

^2[Vi - (axi +6) J 



.,o$' 



. W 






i=l 



2.3 Conditioning 



A problem /() is said to be well-posed if existence, uniqueness, and continuity are 
satisfied. Well-posedness is a sufficient condition for the mathematical inverse f~ 1 () 
to exist. 

However, for the mathematical inverse to exist theoretically does not imply that it 
can be reliably found, in practice, using a computer with finite numerical accuracy. 
That is, the principal limitation in the definition of posedness is that a well-posed 
problem is not necessarily robust with respect to noise or numerical rounding errors. 

The key question in conditioning [27, 3 13] is the numerical precision required to pro- 
duce a meaningful result, which is essentially dependent on the relative magnitudes 
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Blurred Image Deblur - Nonperiodic Deblur - Periodic 

Fig. 2.4. The image deblurring problem, following up on Figure 2.3. The deblurring problem is 
very ill-conditioned, since blurring removes high-frequency details, and deblurring takes dif- 
ferences and multiplies them by a large number in order to reconstruct those details. Inasmuch 
as the image is not known outside of its boundaries, reconstruction errors appear first at the 
boundaries (middle). In the less realistic periodic case, right, both the blurring and deblurring 
assume the image to be periodic. 



of values which need to be added or subtracted from one another. For example, the 
expression 

[(10 20 + 10" 20 ) - 10 20 ] • 10 20 (2.14) 

which obviously equals 10°, is extremely difficult to compute numerically, 3 unless it 
is first simplified, yet operations such as this are very typical in working with poorly- 
conditioned matrices. 

In the context of our canonical system m = Cz, well-posedness is determined as 

Well Posed ^> C is invertible ^^ det(C) + 0. (2.17) 

However the numerical computation of det(C) is susceptible to rounding errors: 



Matlab returns an answer of 0. 
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Example 2.4: Measurement Models and Conditioning 



The degree to which measurements are blurred is a rough measure of condition 
number k. In the following example five measurement structures are shown, rang- 
ing from separate measurements of each individual state element (easy) to highly 
overlapping, nearly redundant measurements (hard). 



Delta 



C 




Wide Exp. 




Gauss. 




Wide Gaus. 




<C) 
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745 



z ± a 



Each panel in the second row shows seven curves, with each curve corresponding 
to one row of C. We have noisy measurements 



m = Cz + v 
which need to be inverted to form estimates: 

z = C~ x m. 



(2.15) 



(2.16) 



Because the measurements are subject to noise, the experiment is repeated 100 
times, with the results plotted as a range (the mean plus/minus one standard devi- 
ation). 

The sensitivity and variability in the estimates z to the choice of measurement 
model (and consequent condition number) is very clear: the larger the condition 
number the greater the variability (i.e., sensitivity to noise). The Gaussian blur 
function, in particular, is exceptionally sensitive and poorly conditioned. 
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Calculated 


Naive 


det(C) 


Conclusion 



Possible Scenario . . . 



System ill-posed 



Rounding error in computation of det(C). 
System is actually well-posed, det(C) 7^ 0. 



^™^,.o o 11 1 Rounding error in computation of det(C). 

0.00043 System well-posed . & r ,^ v y 

System is actually ill-posed, det(G) = 0. 

That is, it is actually very difficult to state, categorically, whether a matrix is singular, 
unless the determinant is computed analytically / algebraically. 

We can quantify the degree to which such numerical errors may occur. Suppose we 
have a well-posed problem and perturb the observation to m + 5m. Thus we have 
the modified solution 

z + 5z = C- 1 (rn + 5m) — ► Sz = C~ 1 Sm (2.18) 

from which we can find bounds 

m = Cz ^ \\m\\ < \\C\\ ■ \\z\\ 

5z = C- 1 5m — ► \\Sz\\ < HC" 1 1| ■ ||<5m|l ■ 

We can therefore express the relative sensitivity in the estimates to perturbations in 
the measurements as 

U\\ m\\ 

where, from (2.19), we find 

k^WCW-WC-'W (2.21) 

where k is known as the condition number of matrix C. 

Alternatively, suppose that the system matrix C itself is perturbed to C + SC, which 
induces a perturbation Sz in the solution to the linear system. Thus 

m = (C + SC) (z + Sz) = Cz + SCz + CSz + SCSz. (2.22) 

Because Cz = m, and ignoring the second-order term, to first order we have 

SCz = -CSz — ► Sz = -C^SCz. (2.23) 

Thus the solution sensitivity is given by 

\\5z\\ _ \\-C-HCz\\ ^ \\-C^\\\\5C\\\\z\\ _ r - Hm n7d , 

w~ fiii - iiii " ll_c mcl (Z24) 

Expressing the relative sensitivity as before, 

IU|| ^ K \\c\\ (225) 



then 



^ = \\-c- 1 \\\\sc\\ 



which is the same notion of matrix sensitivity as before, in (2.21). 
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(2.26) 



Clearly there can be various possible definitions of matrix condition number, since 
the expression 

k(C) = \\C\\-\\C- 1 \\ (2.27) 

is a function of the chosen matrix norm, such as the large family of induced norms 

114*11 



||C|| =max- 



(2.28) 



x^o ll^ll 



derived from any vector norm \\x\\. If the common Frobenius norm [140] is chosen, 
then (2.27) reduces to the most widely used definition of condition number: 



^max(^ ) _ 0"max(v) 



(2.29) 



0"min(C'~ 1 ) CT min (C) ' 

the ratio of largest to smallest singular values (see Appendix A.7.2) of matrix C. 

Intuitively, we interpret conditioning as a generalization of posedness. For our lin- 
ear system m = Cz, the problem being ill-posed implies that C is rectangular or 
singular, further implying that cr m i n (C) = 0, thus k — oo. 

In other words, we can interpret ill-posed problems as lying on one end of the spec- 
trum of conditioning, possessing a condition number of oo, and not materially dif- 
ferent from a well-posed problem having a huge condition number. 



f Example 2.5: Matrix Conditioning 



So what does the condition number k of a matrix really mean? 

Essentially, the condition number measures the degree to which a matrix is sensi- 
tive to the exact values of the matrix entries, entries which might be perturbed due 
to floating-point representation errors, numerical rounding errors, or numerical 
approximations. For example, given the two matrices 



Mi 



"1 





X 





1 





X 





1 



Mo 



1 


0.99 


y 


0.99 


1 


0.99 


y 


0.99 


1 



(2.30) 



Example continues ... J 
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Example 2.5: Matrix Conditioning (cont'd) 



In Mi , any value 



1 < x < 1 



(2.31) 



is valid for x, therefore Mi has a low condition number, since none of the matrix 
entries are terribly constrained by the others. On the other hand, in M2, the value 
of y is forced to be in a relatively narrow range 



0.96 < y < 1, 



(2.32) 



therefore the conditioning of M2 is somewhat poor. (The reader may wish to 
examine (A.55) and Figure A.l in Appendix A.4.) 

One can study condition number more systematically by simulation. Sup- 
pose we have a one-dimensional random process z with Gaussian correlation, 
parametrized by correlation length £: 

z(Z)~P(0 0<£- (2.33) 

Suppose we randomly distort the elements of P with a symmetric perturbation: 

P(£)=P(0 + (W + W T ) Wij-N^a 2 ). (2.34) 

By doing this operation 500 times, we can examine how often P stays positive- 
definite: 




Condition Number of Matrix P 



We see very clearly that larger condition numbers permit only smaller and smaller 
perturbations, if the matrix P is to stay positive-definite. That is, increasing the 
condition number pushes the covariance eigenvalues ever closer to zero, making 
it possible for ever smaller perturbations to push the eigenvalues of P across zero 
to be negative, in which case P fails to be positive-definite. 
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Original Image 



Nonlinear Median Filter 

Fig. 2.5. The image denoising problem, following up on Figure 2.3. By making very modest 
assumptions about image smoothness, it is possible to regularize the inverse problem. A simple 
linear filter assumes the image to be smooth, such that the estimates are slightly blurred to 
attenuate the noise; a median filter, on the other hand, is a nonlinear estimator assuming the 
image to be piecewise planar, such that isolated noise pixels are more completely removed, 
and the image edges remain sharp. Further discussion can be found in Appendix C around 
Figure C.5. 



2.4 Regularization and Prior Models 



The questions of posedness and, to some extent, of conditioning are mostly theo- 
retical ones: in our context, nearly all practical inverse problems are ill-posed. The 
question, then, of greatest practical interest is how to find some sort of meaningful 
solution to a given ill-posed inverse problem. 

For example, although the image subsampling and additive noise examples of Fig- 
ure 2.3 are ill-posed, both problems are routinely solved using interpolation, as is 
illustrated for the denoising problem in Figure 2.5. However the act of interpola- 
tion presupposes some degree of continuity or smoothness in z 9 assumptions which 
were not explicitly stated in the model m = Cz; these additional assumptions or 
constraints are the key to a well-posed problem. 

The general approach is to take a given inverse problem and to improve the posed- 
ness or conditioning by applying additional constraints. In principle, asserting con- 
straints to make a problem well-posed is very easy; however the key to meaningful 
regularization is to assert just enough constraints to adequately condition without 
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Ill-Posedness due to Existence Failure 

Basic Circumstance More measurements than unknowns 
Common Example Linear Regression 

Needed Constraints? Need to make assertion regarding the Measurement model; 
No knowledge needed about z. 

Ill-Posedness due to Uniqueness Failure 

Basic Circumstance More unknowns than measurements 
Common Example Image Interpolation 

Needed Constraints? Need to make assertion regarding the Prior model; 
Knowledge is required about z. 

Table 2.3. A Comparison of Ill-Posedness due to Uniqueness and Existence Failures 



excessively modifying the original inverse problem. The approach taken depends on 
the nature of the ill-posedness or ill-conditioning. 

First, if existence fails but uniqueness is satisfied, we typically select z to be most 
consistent with the measurements; that is, we estimate 

z = arg^min \\m - f{z)\\. (2.35) 

Since uniqueness is satisfied, the most consistent z is expected 4 to be unique; there- 
fore no prior information or additional constraints are required regarding z. The prob- 
lem is made solvable by asserting that the measurement model is not exact, that the 
measurements are subject to error. If we choose a typical squared norm || • || = | • | 2 , 
then (2.35) is referred to as a least-squares problem. 

If uniqueness fails, there are multiple choices of z which perfectly satisfy the for- 
ward problem. Since there are multiple choices for z satisfying the measurement 
model perfectly, we require constraints or prior information on z itself. A very sim- 
ple constraint is to select the smallest z: 

| = arg^min{||z|| \m = f(z)}. (2.36) 

In practice this approach is not very useful (see Example 2.6), so Sections 2.4.1 
and 5.5 develop other constraints. At a minimum, together the constraints and mea- 
surements must satisfy uniqueness. 

It is important to clearly recognize the fundamental differences between problems of 
existence and problems of uniqueness, as outlined in Table 2.3. The natures of the 
problems and corresponding solutions are quite different. 



4 Whether there are one or more most consistent z will depend on the details of /() and the 
choice of norm. The most consistent z will be unique for linear / and quadratic norm. 
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Questions of continuity or conditioning are a bit more subtle, addressed by reg- 
ularization. The idea behind regularization stems from a proposal of Tikhonov 
[308-310], which was to design a family of estimators z(m, A) which continuously 
approximate some ideal (ill-conditioned or discontinuous) estimator z(m), such that 

lim l(m, A) = z(rn). (2.37) 

In the context of this book, our specific interest is in linear least-square inverse prob- 
lems, in which case the regularized estimator, combining (2.35) and (2.36), becomes 



l(ra, A) = arg 2 min{(Meas. Constraints) + A (Estimate Constraints)} (2.38) 
= arg, min{ \\m - Cz\\ R ^ + A \\Lz\\ }, (2.39) 

where each row of L asserts a constraint, separate from the measurements, and where 

||m - Cz\\ R ^ = (m - Cz) T R-\m - Cz) (2.40) 

||L*|| = \\LzWj_! ± (Lz) T r\Lz) = z T (L T L)z, (2.41) 

where R represents the measurement uncertainty. 

The idea is that A controls the degree of approximation: 

Tiny A: Weak constraints, faithful to the original, but poorly conditioned; 
Large A: Strong constraints, a poor approximation, but well conditioned. 

This leaves us with two questions: how do we find appropriate constraints L, and 
how is an appropriate value of A to be selected? 

How we interpret and answer these questions depends on our basic understanding of 
the problem, leading to one of the most significant philosophical divides in statistics: 

Deterministic: The vector z is just a set of unknowns, a set of numbers which we 
need to estimate. 

BAYESIAN: The vector z is a set of random variables, which obey certain statistics 
given by a prior model. 

In many ways this division is nothing more than philosophical, in that mathematically 
the two approaches can result in the same algebraic formulation, as discussed in 
Section 3.2.4. However, the choice of approach may influence how we set up and 
understand an estimation problem, outlined in the following two sections. 
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Example 2.6: Interpolation and Regularization 



Given the inverse problem from Example 2.2, 



z\ 

* Z 2 Z 3 

i i * * 5 



with forward model 



mi 
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Forward 

m—Cz 
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m = Cz or 
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3_ 



(2.42) 



there is a wide variety of approaches which we could now try to follow, in order 
to generate a solution: 



(i) We could try to solve for z as 

z = areu min lira — Cz\ 



Problem: Uniqueness fails 
(as we saw in Example 2.2) 



(ii) We could modify the criterion based on (2.36) and regularize: 

z = arg 5 min< ||ra — Cz_\\ + A \\z_\\ > => Well posed! 
The solution for z then looks like 



-H * X- 



What went wrong? The regularization constraint penalized the deviation of 
individual elements of z from zero, therefore the unmeasured points were 
just set to zero. 

We need constraints which inter-relate the elements of z. 



Example continues . . . j 



2.4 Regularization and Prior Models 



33 



Example 2.6: Interpolation and Regularization (cont'd) 



(iii) To have constraints which inter-relate state elements, we can penalize the 
difference between adjacent elements, essentially a penalty on slope: 



z = arg.g min< \\m — Cz_\\ + A \\Lz\\ > 



(2.43) 



where 









1-10 




L = 


1-10 
1-10 
1-1 




= 1, then the solution (2.43) is found to be 


"5/6 1/6" 
4/6 2/6 
3/6 3/ 6 
2/6 4/ 6 
1/6 5/6 




mi 
m 2 


m\ 
O 

X 

X 
_ X 

— X 

X 

Om 2 



Now we're interpolating! 

Observe, however, that ii ^ mi,^ ^ ^2 because of the compromise in 
(2.43) between satisfying the measurement constraint 

(z± -mi) 2 + (z 2 -ra 2 ) 2 



and the slope constraint 

We can observe the effects of this tradeoff by varying A: 



^2(zi - ii+i) 2 . 



Small A 
Meas. Dominate 



Medium A 
Balanced 



X X X X X 



Large A 
Slope Dominates 



At this point it is not, however, possible to talk about the correct or optimum value 
of A. In principle, all possible tradeoffs between measurement and prior yield a 
valid estimate. Inferring the optimum A on the basis on the measurements is the 
subject of validation, discussed in Example 2.8. 



Example continues ... J 
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Example 2.6: Interpolation and Regularization (cont'd) 



The power of this approach, in comparison to standard interpolation, is in its flex- 
ibility: 

(iv) Suppose we believe that z is smooth, but should decay to zero away from 
measurements: 

| = arg i mm|||m-C||| + Xa \\Lz\\ + A(l - a) \\z\\ }. (2.44) 

This is no longer linear interpolation: 



In both cases we see intermediate points pulled towards zero, 
(v) We could, in principle, have a space- varying slope penalty: 



L = 



1-10 
2-200 
3-30 
4-4 



which would lead to the following estimates: 



having a high penalty (low slope) to the right, and a comparatively low 
penalty (permitting high slope) to the left. 



2.4.1 Deterministic Regularization 



The vector z is just a set of unknowns, a set of numbers which we need to estimate. 
If the unknowns are small in number and existence is satisfied, then the usual formu- 
lation of (2.35) may be adequate. However when uniqueness fails and for the large 
spatial problems of interest to us, some additional regularization / conditioning will 
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be required by specifying constraints L, leading to 

z = arg^min{||ra - Cz\\ R _ 1 + A ||L^||}, 



(2.45) 



where uniqueness is satisfied if and only if the combined measurements and regular- 
ization constraints 

~C 
L 



(2.46) 



have full column rank. By far the most common regularizer is a smoothness con- 
straint. If we imagine, for notational convenience, that z(x) is an unknown signal, a 
function of continuous spatial index x, then we can interpret a smoothness constraint 
as penalizing the derivatives of z(x), integrated over space: 



iLzll-^J'TilzVfdx, 



(2.47) 



where zW i s the zth order derivative. Considering a two-dimensional image z(x,y), 
the most common definitions of smoothness are the first-order membrane constraint 



WUz 




dz_ 
dx 



dz 
dy 



dxdy, 



(2.48) 



which corresponds to the energy of a thin membrane or rubber sheet, and the second- 
order thin-plate constraint 



\L 2 z 




d 2 z 
dx 2 
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d 2 z 
dy 2 



dxdy 



(2.49) 



which corresponds to the energy of a thin steel plate. Obviously the above easily 
generalize to other orders and to other numbers of dimensions, as we show in Sec- 
tion 5.5. In practice, of course, for a vector z the above derivatives are approximated 
as differences. 

With a smoothness constraint chosen, the remaining objective is the selection of 
parameter A. The key problem is that, by definition, there is no optimum or correct 
approach for selecting A, because there is no objective criterion for it. Instead, we 
are required to select some heuristic, which has led to a large literature on parameter 
selection, validation, cross-validation, etc. [139, 140, 172, 173,322,323]. 

Since the focus of this text is on Bayesian regularization, the choice of A does not 
greatly concern us, so we limit our attention to a few basic approaches [27]. 



Validation: We select A to relax the estimation problem to a certain degree of 
"badness." There are two basic alternatives: 

1. In (2.39), of all of the solutions z which satisfy the constraints to some degree, 
select the estimate which best satisfies the measurements; that is, 
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Example 2.7: Interpolation and Smoothness Models 



We continue with Example 2.6: we have a one-dimensional problem, with a few 
measurements, which we would like to interpolate. 

For illustration purposes, we consider two smoothness models: 



First-order example: 
Penalize || 



Second-order example: 
Penalize %-4 

ox z 





In each case there are four measurements, indicated by dots. 

The first-order case penalizes slope (but not mean) and the solution is clearly 
piecewise-linear. As A increases it tends towards a horizontal (zero-slope) line, 
equal to the mean of the data points. 

The second-order case penalizes curvature (but not slope or mean) and is piece- 
wise parabolic. As A increases the solution tends towards a straight line, corre- 
sponding to the least-squares linear regression through the data points. 



z = arg 2 min{ \\m — Cz\\ R - 1 } such that 



< e 



(2.50) 



where e is a specified relaxation parameter. It can be shown [172] that this is 
equivalent to selecting A such that 



\\Lz(m,\)\\ 



(2.51) 



2. In (2.39), of all of the solutions z which satisfy the measurements to some 
degree, select the estimate which best satisfies the constraints; that is, 



z = arg 2 min{||Lz|| 



such that 



\\m-Cz\\ R -i <e 



(2.52) 



where e is a specified relaxation parameter. It can be shown [173] that this is 
equivalent to selecting A such that 



\\m-Cz(m,X)\\ R -i =e. 



(2.53) 
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Cross -Validation: In both of the above cases we still require the selection of 
parameter e, which is a limited improvement over having to specify A. In response, 
a variety of considerably more powerful techniques, known as cross-validation 
or generalized cross-validation, has been developed [322, 323], which allows the 
measurements to specify the most natural or fitting value of A. 

For example, suppose that we have n measurements m , • • • , Z21 n -i, such that 

m i = C i z + v i . (2.54) 

Then define the partial estimator z { (A) as our usual estimator (2.39), parametrized 
by A, but omitting data point mf 




z^X) = arg 2 min { > m, - CjzM + A ||Lz|| > . (2.55) 



Then, because i^(A) did not use m i , we can meaningfully ask how close m i is to 
CiZ^X). Ideally, if the estimator is generalizing or extrapolating sensibly from the 
given measurements (that is, as opposed to being corrupted by numerical errors 
for A too small, or being distorted by excessive constraints for A too large), this 
difference should be small, thus we can estimate A as 

A = arg A min j £ Urn, - dz^W \ • (2-56) 

This particular form of cross-validation used a leave-one-out or "jackknife" ap- 
proach, which is one of the simplest approaches from the large field of bootstrap- 
ping, bagging, and resampling theory [91, 93]. 



2.4.2 Bayesian Regularization 

In the Bayesian setting, the vector z is a set of random variables. That is, the values 
in z are not just unknown, rather they obey certain statistics, specified by a prior 
model. 

Although a wide range of prior models is possible, in the context of this book we 
are most interested in the second-order characterization of the unknowns: that is, 
specifying the prior mean ^ and covariance 5 P, 

z~(m,P), (2.57) 

where, to be a valid covariance, P must be symmetric and positive-definite. If we 
compute the matrix square root P = r T T (see Appendix A. 8), then 



See Appendix B.4 for a brief review of covariance matrices. 
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Example 2.8: Interpolation and Cross Validation 



Referring back to Example 2.7, we can clearly see that the nature of the estimates 
is a function of the chosen penalty and associated regularization parameter. Here 
we use the method of cross-validation, from (2.56), to infer the best value of A. 
Suppose we have a straight-line function, of which we have noisy measurements: 
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We could imagine specifying either a first-order or second-order constraint to reg- 
ularize the problem: 

A first-order constraint penalizes slope 

=>• Our function has slope — a misfit to the constraint 

A second-order constraint penalizes curvature 

=> Our function has no curvature — perfectly consistent 

Since the true function has non-zero slope, there is a tradeoff between believing 
the measurements and the first-order constraint; whereas the absence of curvature 
in the function means that there is no conflict or tradeoff between the measure- 
ments and the second-order case. We can plot (2.56) as a function of A: 



First-order constraint: 



Second-order constraint: 





Hon Constant X 



Regularization Constant A 
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Example 2.8: Interpolation and Cross Validation (cont'd) 



In the first-order case we can clearly see a unique optimum A w 100, repre- 
senting the best tradeoff between measurements and constraints. In contrast, the 
second-order case seeks an arbitrarily large A: the more strongly the constraints 
are asserted, the better are the results. Below are plotted the estimates for three 
values of A: 



First-order 

(A = 10°,10 2 ,10 4 ) 



Second-order 

(A = 10 2 ,10 4 ,10 6 ) 




In the first-order case, the transition from measurement overfit (thin line, A too 
small), to well fit, to overconstrained (thick line, A too large) is clearly visible. In 
the second-order case, the larger values of A progressively lead to a better fit. 



r i r 



r~ 



r- T (z-^) = w, 



(2.58) 



where w is a unit- variance, white-noise vector. That is, the covariance P implicitly 
specifies a set of independent constraints r~ T on z; similarly, a set of constraints L 
implicitly specifies a sort of prior model 6 P = (L T L) _1 . 

In other words, there is a certain duality, or equivalence, between specifying a statis- 
tical prior covariance P for z and a set of constraints L = r~ T on z. The differences, 
then, between a Bayesian prior model and a set of constraints are as follows: 

1. In the Bayesian case, the statistics are an inherent part of the problem, they are 
not just a constraint to condition the numerics. Of course, the degree to which we 
believe the statistics may live on a continuum: 

• High belief: the statistics are known to be correct, and can be proved from 
physics or derived mathematically; 



6 In many cases (L T L) will be singular, however this does not necessarily prevent us from 
computing estimates. 
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• Medium belief: the problem obeys a prior, however the prior is complicated 
or unknown, so we approximate it, possibly learned from measurements; 

• Low belief: the problem doesn't really obey a prior; the statistics are asserted 
essentially for regularization purposes. 

2. In the Bayesian case, there is a unique solution to the constraint parameter A. In 
order to minimize the mean- squared error 

E[(z-z)(z-z) T ] (2.59) 

we will see (Chapter 3) that A = 1. That is, the tradeoff between measurements 
and prior is not under user control, but is inherent in the relative uncertainties in 
the measurements and in the prior model. 



2.5 Statistical Operations 



The discussion thus far has been somewhat abstract, focusing on establishing an 
understanding of inverse problems and related issues of posedness, conditioning, 
regularization, and prior models. 

The remainder of this chapter seeks to make the discussion considerably more spe- 
cific, focusing on the three canonical problems which are of greatest interest to us, 
and a discussion of the types of statistical questions which we may wish to answer. 



2.5.1 Canonical Problems 

We are interested in large static and time- varying linear stochastic systems. The 
reader should observe how these problems are interconnected: the static problem is 
a special case of data fusion, and the static and data fusion problems are both special 
cases of the dynamic one. 

I. The Static Problem: Given a single random field z obeying some prior model, 

z~Af(Q,P), (2.60) 

we are given a single set of measurements 

m = Cz + v v~Af(Q,R). (2.61) 

Examples of this category of problem include image enhancement, image restora- 
tion, and many applications of medical imaging. 
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The reader may wonder whether the above formulation is sufficiently general, 
since z and v may not be guaranteed to be zero mean. In fact, because our problem 
is linear, superposition applies, in which case the deterministic (mean) and random 
parts of the problem are fully decoupled. In other words, given 

z~N(&P), (2.62) 

we can formulate the mean-removed field 

z = z-y,~N%P). (2.63) 

After computing the estimates z for the mean-removed field, it follows that 

I = I + Ul- (2-64) 

Similarly, because any deterministic mean portion of the measurements is known 
ahead of time, the mean does not affect the estimates. That is, given 

m = Cz + a + v v~N(H,R), (2.65) 

for constants a, Q_ it follows that 

z(m) = z(m — a — @). (2.66) 

Thus, without loss of generality, we normally assume all mean terms to equal 
zero. From time to time, if it assists in clarity or intuition, the mean term may be 
included. 

II. The Data Fusion Problem: We are given one or more random fields z { obey- 
ing some prior model, 



z = 



<A/"(Q,P), (2.67) 



where the prior model may couple the fields (dense P) or leave them uncoupled 
(block-diagonal P). 

We are given two or more sets of measurements 

m i = dz + v { Vi~ A/"(0, Ri). (2.68) 

The relationship d between measurements and unknowns may be highly problem- 
dependent. 

Such data fusion problems are particularly common in remote sensing, where mea- 
surements may be available from multiple instruments on a single satellite plat- 
form, or where data from multiple satellites are to be used in studying a particular 
area. 
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III. The Dynamic Problem: We are given an initial prior model 

z(0)~M(Q,P ) (2.69) 

and a time-recursive dynamic model 

z(t + 1) = A(i)z(i) + B(t)w(t) w(t) ~ Af(Q, J) (2.70) 

where the process noise or driving term w(t) is white and uncorrelated with z\ 

E[w(t)w(s) T ] = S 3it I E[w(t)z(s) T ] =0 if t > s. (2.71) 

Measurements of the process arrive over time: 

m(t) = C(t)z(t) + v(t) v(t) - A/"(0, i?(t)) . (2.72) 

The most obvious examples of multidimensional dynamic problems are in video 
data processing. 



2.5.2 Prior Sampling 

Given a random variable z obeying some prior probability density function (PDF) 
p(z), sampling from the prior distribution means generating independent random 
samples zi,...,z q from p(z). For a set {zi} to represent independent random sam- 
ples of a distribution, it implies that any statistical question posed of random variable 
z and of the samples {z^ converge to the same value as the size of the sample set 
grows. In other words 

lim -£/(*)—►£;[/(*)] (2.73) 

i 

for any function /. 

In the context of the canonical static problem (2.60), given a random vector z obeying 
a prior model 

z~(&P), (2.74) 

we may be interested in generating independent random samples z ± , . . . ,z q such that 
- E ^ — U - YMi - M) Ui - M) T — P. (2.75) 

i i 

Such samples may be generated by finding r, the matrix square root (see Ap- 
pendix A. 8) of P: 

Given w~I^ cov(r T w) = E [r T ww T T] = r T IT = P. (2.76) 
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Fig. 2.6. Three samples from the prior model corresponding to a "Tree-Bark" random field. 
Being random samples, all three images are different, yet obey the same statistics. 



Thus ji + r T w is a random prior sample from model (ji , P) . Three samples from a 
single random field model are shown in Figure 2.6. 

In the context of the canonical dynamic problem (2.70), the process of prior sam- 
pling is normally known as simulation, in that we are simulating the evolution of the 
dynamic process over time. The process initialization 



*(0)~JV(0,P ) 



(2.77) 



requires sampling z(0) from the prior model Pq, as discussed above. With this step 
completed, the remainder of the simulation is the straightforward recursion 



z(t + 1) = A(t)z(t) + B(t)w(t). 



(2.78) 



Because w(i) obeys a simple prior A/"(0, /), generating a random sample of w(t) 
just amounts to generating a vector of zero-mean, unit-variance Gaussian random 
variables. 

The generation of samples from a prior model is of interest for two main reasons: 



1. The samples z or z(t) depend purely on the prior model, with no dependence 
on any measurements. Thus the prior samples reflect the assumptions implicit 
in the prior model, possibly leading to insights regarding model strengths and 
weaknesses, based on whether the observed sample behaviour is as expected or 
runs counter to the desired physics or mathematics. 

2. The samples z or z(t) may be of interest in and of themselves. Two common 
examples include image rendering (e.g., sampling of random textures from a prior 
texture model) and further analysis (e.g., sampling of random three-dimensional 
porous media for further porosity or groundwater- flow analysis). 
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Available Measurements 



t 



Time 



Smoothing Filtering Prediction 

Fig. 2.7. Dynamic estimation can be broken into three inter-related problems: smoothing, 
filtering, and prediction, depending on whether the value to be estimated lies within, at the 
end, or after the end of the available measurements, respectively. 



2.5.3 Estimation 

Estimation is the solving of an inverse problem, the production of estimates of the 
state z underlying a system based on the measurements m. 

In the case of the Static Problem we seek estimates z which are simultaneously 
consistent, according to some criterion, with the prior model (2.60) and with the 
measurement model (2.61). 

In the analogous case of the Dynamic Problem we seek estimates z(t) which are si- 
multaneously consistent with the initial prior Pq, the dynamics (2.70), and the mea- 
surements (2.72). Dynamic estimation is frequently divided into the three related 
problems, illustrated in Figure 2.7: 

1. Filtering: estimate z(t) based on measurements m(s), < s < t. 

2. Smoothing: estimate z(t) based on measurements m(s), < s < t + r. 

3. Prediction: estimate z(t) based on measurements ra(s), < s < t — r. 

That is, the problem depends on the distribution of available measurements relative 
to the quantity being estimated. The solution to the above three is very similar, so we 
normally do not concern ourselves with the distinction between them and generally 
just refer to estimation. 

A large number of different estimators have been proposed, for which a brief sum- 
mary follows. A more detailed derivation and development of those estimators used 
throughout this book may be found in Chapter 3 and in [248, 284]. 
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Bayesian Estimators 

The unknown z is random for which prior information (possibly mean and covari- 
ance, higher-order statistics, or the full PDF) is available. 

Bayesian Estimator (BE): the most general estimator, 

Ib ~ arg i mm|£;[C(z,|)|m] j, (2.79) 

finds estimates to minimize a specified cost function C(). 

Bayesian Least-Squares Estimator (BLSE): a widely-used special case of 
the above, and normally much more tractable, is to choose a quadratic form for 
C(): 

= arg.g min< E[(z — z) T (z — z)\m\ >. (2.80) 



^BLSE 



Linear Least-Squares Estimator (LLSE): also known as the Best Linear Un- 
biased Estimator (BLUE), uses the same quadratic criterion as for the BLSE, ex- 
cept that the estimator is required to be a linear function 

Illse = Am + b (2.81) 

of the measurements. That is, the effective LLSE criterion is 

Illse = A Hk + k [A k] = ar g [A ,5] min J£ [(z - Am - b) T (z - Am - b)] | . 

(2.82) 

Maximum a Posteriori (MAP): the estimate is chosen to maximize the poste- 
rior probability density 

z MAP = arg 2 maxp(z|m) = arg 2 max p(m\z)p(z) , (2.83) 

where the latter form, derived via B ayes' rule ((B.21)), is normally the more con- 
venient, because it expresses the estimator in terms of known quantities: the prior 
model p(z) and the measurement model p(m\z). 

KRIGING: a long-established technique of spatial estimation [75, 227, 291], the 
method is similar to the LLSE and BLUE, in that an estimator is sought, linear 
in the measurements 

z = Am + b (2.84) 

such that 

E[z — z\ =0 var(i — z) is minimized, (2.85) 

where the prior spatial model for z is specified through its variogram 

7(*) = |var(zi-z 2 ), (2.86) 
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Example 2.9: State Estimation and Sampling 



Building on Example 2.3, let's consider the infinity of all possible lines (a first- 
order problem, left) or parabolae (second-order, right): 





The above are essentially prior samples: random samples unconstrained by mea- 
surements. A measurement is just a constraint, asserting a preference for certain 
curves over others: 



One 

Measurement 





<= Both => 

Ill-Posed 

(Uniqueness Fails) 



A single measurement is unable to constrain a line or a parabola, so uniqueness 
fails, and the dashed line plots the minimum-norm solution of (2.36). With two 
measurements the first-order problem is well-posed, however because the mea- 
surements are not exact the posterior samples still show some variability. For the 
second-order problem uniqueness still fails: 



Two 
Measurements 





At three measurements the first-order problem is overconstrained (existence fails), 
still allowing a least- squares estimate to be computed from (2.35), whereas the 
second-order problem is now well-posed: 



Example continues . . . j 
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Example 2.9: State Estimation and Sampling (cont'd) 



Ill-Posed - Existence 

\ZLeastSq.) 



Well-Posed 

(z) 




Three 
Measurements 




We clearly see measurements as constraining the statistics of the problem, such 
that more measurements increasingly concentrate the posterior samples near the 
estimates. 

The elegance of the Bayesian formulation is that the preceding distinctions be- 
tween minimum-norm or least- squares solutions are unimportant, and the esti- 
mates z (dashed curve in each panel) are calculated the same way in each case. 

To better understand the calculations in this example the reader may wish to fol- 
low up with Example 3.1. Two parallel examples, showing ensembles of possible 
curves as a way of visualizing statistics, are shown in Example 3.4 on page 75 for 
the static problem, and Example 4.1 on page 92 for the dynamic problems. 



where z\ and Z2 are separated, spatially, by an amount S. 

It is significant to note that the kriging variogram encodes only a knowledge of 
the statistics of differences. That is, there is absolutely no notion of prior mean or 
variance. 



It is significant to note that for Gaussian problems (that is, those in which z and m are 
jointly Gaussian), the MAP and BLSE estimators are linear, and in fact are identical 
to the LLSE. 

Since the LLSE is the only estimator requiring no information other than the second- 
order statistics of the estimation problem, it is a common choice if the known statis- 
tics are limited. Even if the statistics are known in detail, and are non-Gaussian, 
the LLSE may still be chosen, at least as a first step, given the alternative of highly 
complex, normally nonlinear estimators BE, BLSE, and MAP. 
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Non-Bayesian Estimators 

The unknown z is not random, and no statistics of z are used. 

Least Squares (LS): the measurements are modelled to be a noisy function of 
the unknowns 

m=f(z)+v, (2.87) 

thus the unknowns are chosen to minimize the squared error in the measurements: 

z LS = arg^minfra - f(z)) (m - f(z)). (2.88) 

Linear Least Squares (LLS): given the linear measurements 

m = Cz + v, (2.89) 

the unknowns are chosen to minimize the squared error 

z LLS = arg 2 min(ra — Cz) T (m — Cz), (2.90) 

leading to an estimator z LLS a linear function of m. 

Weighted Linear Least Squares (WLS): as before, however the measure- 
ment errors are not all treated equally, instead, a weighting matrix W controls 
the relative degree to which measurement errors are tolerated: 

Iwls ~ arg^min(m - Cz) T W(m - Cz). (2.91) 

Total Least Squares (TLS): the error is modelled both in the measurements 
m, but also in the measurement model C [141]. That is, the measurements are 
modelled as 

m = (C + E)z + v, (2.92) 

where E is the unknown perturbation in the measurement model. Total least 
squares then seeks to minimize the norm of the concatenated error || [vE\ ||2, 
which is found as the singular value decomposition of matrix [m -C] . 

Maximum Likelihood (ML): the non-Bayesian analogue of MAP, the estimates 
are chosen to make the observed measurements most probable: 

1ml - arg^maxp(m|z). (2.93) 

Although we are primarily concerned with Bayesian methods of estimation, we shall 
see (Section 3.2.4) that there is a certain duality between the Bayesian and non- 
Bayesian methods. As with the Bayesian methods, in the case where the measure- 
ment model is linear and the measurement noise statistics are Gaussian, then the ML 
estimator is linear and equivalent to the LLS, WLS estimators (for certain choices of 
weighting matrix and measurement error statistics). 
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z\m) 



Fig. 2.8. An illustration of posterior sampling: only a sparse subset (middle columns) of a 
texture are measured (left), leading to estimates (middle) which lack the statistical variability 
of the original texture. A posterior sample (right) is consistent with the measurements, but 
where not constrained by the measurements follows the statistical pattern of the prior model. 



2.5.4 Posterior Sampling 

In Section 2.5.2 we discussed the sampling of a random vector 

z ~ p(z) (2.94) 

such that the independent random samples z_ x , . . . , z q represent "typical" samples of 
the prior, without any measurement constraints. 

For estimation, we seek the unique, optimum estimate z which balances the con- 
straints of a prior model p(z) and measurements m, for example 

1 MAP = arg^max p(m\z)p(z). (2.95) 

It is important to realize that the estimate z does not necessarily look like a typical or 
natural sample of the random field, especially if the measurements are sparse. That is, 
although the estimate is optimum under some criterion, it may not represent a typical 
or representative sample of the system being studied. This is precisely because the 
most likely (MAP) sample is not typical: for a Gaussian random variable z of mean 
/i, the most likely value of z is /i, however we do not expect a typical sample of z to 
precisely equal /i ! 

Instead, to find a typical random field, subject to the constraints of both the prior and 
measurements, requires that we draw a random sample from the posterior distribu- 
tion, a much more subtle and difficult problem than estimation. That is, we want to 
draw a sample from the conditional distribution 

(z\m) ~p(z\m), (2.96) 
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as is illustrated in Figure 2.8 and Example 2.9. 

As in the case of prior sampling, samples drawn from the posterior may be desired 
for purposes of visualization, further analysis, or Monte Carlo studies. 

2.5.5 Parameter Estimation 

In some contexts the prior model may not be completely specified. Its form (e.g., 
Gaussian) may be known, but certain details (such as the variances or correlations 
between random variables, etc.) are not, and need to be estimated. That is, we have 
vectors of unknowns 0_ Pl 0_ M in the prior and measurement models, respectively: 

z~p(z\0p) m^p(m\z,0 M ). (2.97) 

In virtually all cases, the parameters are treated as unknowns (rather than random), 
and so a non-Bayesian approach, typically Maximum Likelihood, is used. That is, 
we select those model parameters which are most consistent with (maximize the 
likelihood of) the measurements: 

Op,0 M = arg q maxp(ra|# P ,# M ). (2.98) 

A full treatment of model parameter estimation is really the domain of system iden- 
tification [212], which is well beyond the scope of this book. We are, however, inter- 
ested in model identification in certain contexts where we propose a specific spatial 
model, such as a Markov random fields in Chapter 6 or the marching and multiscale 
models in Chapter 10. 



Application 2: Ocean Acoustic Tomography 



There are many circumstances in which we measure integrals through absorbing 
media, such as in medical imaging (as was illustrated in Figure 2.1 for X-rays); the 
process of inverting such measurements is known as tomography. 

One recent and very novel tomographic problem is the acoustic "imaging" of the 
ocean [238]. Amazingly, low-frequency sound waves can propagate for thousands of 
kilometers. The key reason for the effective sound propagation is that the ocean acts 
as an acoustic waveguide: as shown in Figure 2.9, at a depth of approximately 1 km 
the sound speed is at a minimum, meaning that propagating wavefronts will bend 
towards 1 km depth, rather than hitting the surface or being absorbed at the ocean 
bottom. 

Figure 2.9 sketches three different acoustic waves, all leaving from the same source 
and being observed at a common receiver. Observe what is happening: the different 
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Fig. 2.9. Because the speed of sound in water has a minimum at an intermediate depth, left, 
the ocean acts as an acoustic waveguide, right. 



waves "observe" different parts of the ocean, such that the high- amplitude wave mea- 
sures a wide range of depths, whereas the low-amplitude wave is mostly sensitive to 
the ocean at a depth of 1 km. 

What can these different sound waves tell us? Each sound wave has a different arrival 
time, where the arrival time is a function of the average (or integrated) sound speed 
over the path of the wave. The sound speed is primarily a function of depth (thus 
A arrives first, then B, then C), but also salinity, temperature, and ocean currents. 
The effect of salinity is slight, and the effect of depth can be accounted for via a 
sufficiently accurate ray-tracing along the lines of Figure 2.9, leaving the effects of 
temperature and currents to be inferred. 

We now have two different inverse problems to solve, as illustrated in Figure 2.10: 



Acquire the tomographic line-integral measurements: 

The measured arrival times depend on the average sound speed along the highly 
complex wavefront trajectories. To convert into a simpler, standard tomographic 
problem involving straight-line integrals, we wish to divide the ocean into layers, 
as shown in Figure 2.10, and to solve an inverse problem to find the average sound 
speed over each layer. 

That is, if ray j has an observed average sound speed of rrij, and by simulating 
the ray paths we know that ray j spends a fraction a^ in depth layer z, then we 
wish to solve 



mi 
m 2 



(2.99) 



In practice, we would probably have a number of acoustic receivers spaced at 
different depths. 
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Fig. 2.10. Two inverse problems need to be solved. A 2D space-depth problem, left, inverts 
the acoustic arrival measurements to average sound speed by layer. A second 3D space-space- 
depth problem, right, solves for the three-dimensional pattern of sound speed based on many 
source-receiver pairs, in order to infer ocean temperature and currents. 



2. Invert the 3D tomographic problem: 

As shown in Figure 2.10, we may have a number of source-receiver pairs, each of 
which leads to a 2D inverse problem in (2.99). We now have a three-dimensional 
inverse problem, with sources and receivers distributed spatially, and with layers 
over depth. Although such problems are widely studied in the medical imaging 
literature, the ocean acoustic problem is far more sparse and irregular, making 
many medical approaches inapplicable. 



Summary 

An inverse problem exists any time we wish to infer one or more unknowns from 
a set of measurements. The forward problem proceeds from the unknowns to the 
measurements; the inverse problem is the inference from the measurements to the 
unknowns. The inverse problem is said to be well-posed if the following three criteria 
are satisfied: 

Existence: Every observation m has at least one corresponding value of z. 
Uniqueness: For every observation m, the solution for z is unique. 
Continuity: The dependence of the solution z on m is continuous. 



However, even problems which are well-posed may fail to be solvable numerically 
because of poor conditioning, in which case we regularize the problem: 

|(ra,A) = arg^min|||m-Cz|| jR _ 1 +A||Lz|||, (2.100) 
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where A is the regularization constant, which may be learned by validation, and 
where L is a set of assumptions or constraints on z. 



For Further Study 



To look further into inverse problems, the papers by Bertero et al [27] and by Geman 
[129] are highly recommended, as they combine questions of inverse problems and 
low-level computer vision in the former paper, and random fields in the latter. 

To read further in the area of regularization, three papers are recommended [27, 189, 
304], in addition to the classic text [308] by Tikhonov and Arsenin. 

The references, sorted by category starting on page 433, give additional text and 
paper citations. 
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Problem 2.1: Uniqueness and Ill-Posed Problems 

(a) Define "Uniqueness" 

(b) Given matrix C — [12]: 

• Compute Rank(C) 

• Compute or describe Ra(C), Nu(C) 

• Does the inverse problem m = Cz satisfy uniqueness? 

(c) In general, for C a k x n matrix, prove that uniqueness fails for k < n. 

Problem 2.2: Existence and Ill-Posed Problems 

(a) Define "Existence" 

(b) Given the matrix 

C = 

• Compute Rank(C) 

• Compute or describe Ra(C), Nu(C) 

• Does the inverse problem m = Cz satisfy existence? 

(c) In general, for C an k x n matrix, prove that uniqueness fails for A: > n. 
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Problem 2.3: Number of Solutions 

For a linear inverse problem, prove that the number of solutions must either be 
zero, one, or infinitely many. 

Problem 2.4: Numerical Conditioning 

We want to look at matrix conditioning, as a function of statistics and matrix 
size. We imagine that we have a vector x having n elements, such that x has an 
associated n x n covariance P. 

We can study the numerical behaviour of P in a variety of ways; here we consider 
two: 

1. The condition number k,(P) 9 evaluated in MATLAB using function cond 

2. The degree of error in matrix inversion. Let 

e = max(diag(P • P -1 )) - min(diag(P • P _1 )Y 

In the ideal case, P • P _1 = /, in which case e = 0. Larger values of e imply 
greater difficulties in computing accurate inverses, a surrogate for condition- 
ing. 

In a numerical mathematical package, such as MATLAB, prepare the follow- 
ing: 

(a) Test conditioning as a function of the degree of correlation: 
Create matrices A\ , . . . , A^ where 

A n is 20 x 20, (A n )ij = exp(-|i - j\ • 0.3 n ). 

(b) Test conditioning as a function of matrix size: 

Create exponentially-correlated matrices B\ , . . . , P 5 o where 

\i - j\ 
B n is n x n, (B n )ij = exp ' 

(c) As in (b), with the same correlation length (of 3), but with different statistics: 
Create Gaussian-correlated matrices C\ , . . . , C50 where 

C n is n x n, (C n )ij = exp I -\ ' - 

Plot the condition number k and inversion inconsistency e for each of A n , B n , C n 
as a function of n. Study the behaviour of the plots. 
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Problem 2.5: Open-Ended Real-Data Problem — Image Processing 

Most numerical mathematics programs offer a variety of regularization methods 
and inverse problems. Both Matlab and Octave, a free open-source alterna- 
tive, offer most of the following routines: 

me d f i 1 1 2 Two-dimensional median filtering 

f i 1 1 e r 2 Two-dimensional image filtering 

wiener 2 The two-dimensional Wiener filter for denoising 

i radon The inverse Radon transform, for medical tomography 

deconvwnr The two-dimensional Wiener filter for deblurring 

deconvreg Image deblurring using a regularized filter 

ipexregularized Image deblurring example 

Acquire some number of images, of your own or from the Internet, and experi- 
ment with one or more of the above functions: 

(a) For denoising, report your results as a function of the added noise variance, 
and the type of noise (Gaussian, uniform, salt-and-pepper). 

(b) For deblurring, report your results as a function of the degree of blur. 



Static Estimation and Sampling 



This chapter derives the two fundamental linear estimators: 

Section 3.1: The non-Bayesian linear least-squares estimator 
Section 3.2: The Bayesian linear least-squares estimator 

Throughout this chapter we concern ourselves with the derivation of algebraic es- 
timators z for some random vector z, but ignoring the issues of what z represents, 
or any concerns regarding its size, both of which are extremely important and are 
examined closely beginning with Chapter 5. 

The question of how to actually represent a multidimensional field in a vector is the 
topic of Chapter 5, and questions of computational tractability and simplification are 
addressed starting in Chapter 9. 

In this chapter we consider z to be static, that is, where z is a single unknown vector 
which has no time dependence or evolution over time. Issues of time dependence are 
considered in Chapter 4. 

Even within this static framework there are a number of variations, depending on 
what we assume or know about the measurement errors (and their statistics), and 
similarly whether z is just unknown, or subject to the constraints and assumption of 
some prior statistical model. 
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3.1 Non-Bay esian Estimation 
Linear Least Squares 

We'll start with the simplest possible linear inverse problem, where we wish to infer 
unknowns z from measurements m: 

m = Cz. (3.1) 

In many cases this problem will be ill-posed: 

If m E Ra(C) then existence is satisfied, but uniqueness may not be. 
If m £ Ra(C) then existence fails. 
If Nu(C) 7^ then uniqueness fails. 

Only if C is square and invertible is the problem well-posed. 

The above pessimistic summary, that only problems in which C is invertible may be 
solved, is due to limitations in our problem formulation (3.1). We rarely expect C 
to be invertible; indeed, normally we have more measurements than unknowns, in 
which case C is a tall rectangular matrix. Furthermore we don't actually believe that 
m = Cz; rather, measurements are invariably corrupted by some inaccuracy or noise 

v.* 

m = Cz + v, (3.2) 

in which case we are interested in finding z such that Cz is close to m, but not 
necessarily equal. We often choose a least-squares criterion (2.90): if we define the 
estimation error e as 

e = m — Cz, (3.3) 

then the least-squares goal is to find z by minimizing e T e. There are many ways of 
deriving the solution to this problem; we consider two. 

Algebraic Derivation: The squared error to minimize is 

e T e = (m - Cz) T (m - Cz), (3.4) 

thus the vector derivative (Appendix A. 6) is 



de e d 



m T m - 2m T Cz + z T C T Cz 



(3.5) 



dz dz 

= T - 2m T C + 2z T C T C (3.6) 
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Fig. 3.1. The orthogonality principle: Existence fails here, since m lies outside of Ra(C). 
To solve this ill-posed problem, least- squares seeks that point in Ra(C) which is closest to m. 
This closest point must be ra , the point where the error rn — rn is at right angles (orthogonal) 
toRa(C). 



So e e is minimized when z C T C = m T C or, taking the transpose, when 
C T Cz — C T m. In the event that C T C is invertible, meaning that C has full 
column rank, we obtain the well-known least-squares estimator 1 



z = (C T C) 1 C T m. 



(3.7) 



Geometric Derivation — The Orthogonality Principle: We can reinter- 
pret the problem in a more geometric sense. Since Cz can explore all of Ra(C), 
finding z to minimize \\m — Cz\\ is equivalent to finding the element in Ra(C) 
closest to m. That is, which vector m = Cz G Ra(C) minimizes 



\\m — Cz\\ = \\m — m\\ = (m — m) T (m — m)? 



(3.8) 



Let m be the unique element of Ra(C) such that the error m — m is orthogonal 
to Ra(C); that is, that 



(m - m ) -L Ra(C) * 
This construction decomposes the error 



(m - m ) T a = V£ G Ra(C) 



(m — Cz) = (m — m ) + (m — Cz) 



(3.9) 



(3.10) 



si 



^2 



1 It should be mentioned that the estimator in (3.7), although correct, should not actually 
be solved using an explicit matrix inverse, especially for large problems. This is discussed 
much later, in Section 9.1. 
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into two orthogonal components e x JL e 2 , such that by Pythagoras 

||ra -Cz\\ = ||ra - raj| + ||ra -Cz\\. (3.11) 

V • S • V • 

V V V 

Error to Minimize Fixed Set to Zero 

Of the two terms on the right-hand side, the first is not under our control, so the 
best we can do is to set the second term to zero, which is accomplished by setting 

Cz = ra . 

Finally, how do we find ra ? The projection m must satisfy two criteria: 

1. ra must be in the subspace Ra(C) 

2. The difference ra — m must be orthogonal to Ra(C). 

We claim that 

m = C {C T C)- x C T m (3.12) 

is the projection of ra onto the range space Ra(C), which can be verified: 

1. ra = C (C T C)- x C T m = C[ • • ] € Ra(C) , 

2. C T (m - raj = C T m - (C T C)(C T C)~ x C T m = . 

So we have identified ra = C (C T C)~ x C T m as the optimum element in the 
subspace of C, minimizing the error with respect to ra; thus we identify 

Cz = m = C {C T C)- x C T m -> z= {C T Cy x C T m (3.13) 

as the selected estimator. 

The computation of z = {C T Cy x C T m requires the invertibility of (C T C), which 
is true if and only if C has full-column rank. This assumption may fail under the 
following circumstances: 

• Suppose C is k x n,k < n: 

We have fewer equations than unknowns and (C T C) is singular. The solution for 
z is not unique, and additional constraints (prior model, regularization constraints) 
are required. 

• Suppose C is k x n, k > n: 

We have at least as many equations as unknowns; (C T C) will be singular if the 
columns of C are linearly dependent, meaning that some portion of z is unob- 
served (C has a nullspace). 

Much of the above mathematics is represented elegantly in the theory of pseudoin- 
verse matrices [4], which is summarized in Appendix A.9. In particular if C is non- 
singular then the pseudoinverse C + of C is the matrix inverse C + = C -1 , general- 
izing to the projection operator C + = (C T C) _1 C T in the special case that C has 
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full column rank. However unlike the projector, C + exists for all real C, whether 
rank deficient or not. 

Indeed, given observation C, the estimator 

l=C + m (3.14) 

based on the Moore-Penrose pseudoinverse C + [4], produces the minimum-norm 
least-squares solution for z. That is, of all of the (possibly many) z minimizing 

(m-Cz) T (m-Cz), (3.15) 

C + m chooses that unique vector which minimizes z? z. 

Weighted Least Squares 

Our previous derivations considered all of the measurement errors as equivalent, 
equally penalizing the error in each dimension. In practice we may prefer a weighted 
criterion, minimizing the least-squares weighted error (Fe): 

(Fe) T (Fe) = e T We = \\m - Cz\\ w (3.22) 

for some appropriate choice of W = F T F. However, if we transform the error by 
weight matrix F, 

e = m — Cz =>• Fe — Fm — FCz 

^=> e = m — Cz 

As a consequence, the error criterion (3.22) can be similarly transformed as 

e T We = (Fe) T (Fe) = fe. (3.24) 

That is, we have converted the problem to regular least squares, to which we have 
already derived the answer in (3. 7), (3. 13): 

z = (C T C)- 1 C T m (3.25) 

= (C T F T • Fe)' 1 (C T F T ) • Fm = (C T W C)' 1 C T W m, (3.26) 

where if W is nonsingular, the conditions for the invertibility of (C T WC) are the 
same as those for (C T C). 



Constrained or Regularized Least-Squares 

The reader may wonder regarding the relevance of the discussions of least-squares 
solutions torn = Cz + v 9 given that much of Chapter 2 was devoted to demon- 
strating the ill-posedness of most such problems and, in Section 2.4, the discussion 
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Example 3.1: Linear Regression is Least Squares 



Suppose we are given points {xi , yi } : 




We wish to do linear regression, meaning that we assert a model 

yi = axi + & + Vi, 
where Vi is some error. Thus we have unknowns z and measurements m: 

"2/i" 



m 



Vk 



(3.16) 



(3.17) 



leading to the usual formulation 



m 



~yi 




X\ 1 








x 2 1 




a 










b 


jjk_ 









-\-v = Cz + v. 



(3.18) 



We know the usual least- squares solution to this canonical problem to be 

z= (C T C) _1 C T m. 
Substituting (3.18) into (3.19) gives 









(3.19) 



(3.20) 



which is the familiar solution to linear regression. 



Once we recognize and understand the above development, linear regression is 
easily generalized to other cases, such as 

yi = ax^ + bxi + Vi yi = ae Xi +Vi yi — a sin Xi~\-b cos 2xi~\-V{ (3.21) 

jji does not need to be linear in xu however it does need to be linear in the un- 
knowns a, b. For example, the generalized regression yi = s'm(axi) + Vi is a much 
more difficult problem. 
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of constraints and regularization schemes to permit a solution. In particular, (2.39) 
formulated the following constrained least- squares criterion: 



mm{\\m-Cz\\ R ^^X\\Lz\\}. 
This criterion is, however, just a variant of weighted least- squares: 

Hm-Cill^+AHLill 

= (m - CzfR-^m - Cz) + X(Lz) T (Lz) 



(3.27) 
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(3.28) 



= (m-Cz) T R- 1 (m-Cz) 



\m — Cz\ 



R- 1 



That is, the regularization constraints A \\Lz\\ are algebraically equivalent to the in- 
clusion of an additional "measurement" 



Lz = 0. 



(3.29) 



Associating the above criterion \\m — Cz\\^_ 1 = e R~ x e with the criterion e We 
from weighted least squares allows us to find the solution directly from the weighted 
least-squares solution (3.26), with W = R~ x '. 
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{CFR^C + XL T L) C T R- 1 m 



(3.30) 
(3.31) 
(3.32) 



This result, the estimator for regularized least- squares, is of key importance through- 
out the following chapters. 



Maximum Likelihood 



A related approach is that of maximum likelihood, still treating z as unknown (i.e., 
non-B ayesian), but where we formulate the problem more explicitly using statistics. 
Given measurements m with known Gaussian noise statistics 



m = Cz + v v~AT(Q,R). 
we seek the maximum likelihood estimate (2.93) 



(3.33) 
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z ML = arg^max p(m\z) (3.34) 

= arg^max T^w^j ex P (-^(iR-CzfR^im-Cz)) (3.35) 

= arg 2 min (m — Cz) 7 R~ 1 {m — Cz) (3.36) 

= arg^min (e^iT^e) (3.37) 

which we immediately recognize as the weighted least-squares criterion (3.22), 
where we identify W = R~ x . The solution follows from (3.26) 

z = (C T R- 1 C)- 1 C T R- 1 m. (3.38) 

Associating W with R~ x is intuitive. A small value Ru implies a small measurement 
noise variance Vi, meaning that we expect to find an estimate z which makes e$ small. 
Therefore we wish to assert an intolerance for large e^, implying a large Wu. 

Thus we have derived the LLS, WLS, and ML estimators, and have seen the great 
deal which they have in common. It is important to keep in mind that the parallels 
with ML do assume Gaussianity, and do not hold for other statistics. 



3.2 Bayesian Estimation 



We move now to Bayesian estimation, in which our unknowns z are random — 
statistical quantities subject to a prior model. 

The Bayesian formulation leads to a few subtle changes, compared to the non- 
Bay esian case. The error e is now the estimation error 

e = z = z-z, (3.39) 

and the stochastic nature of z implies that e too is stochastic, so the least- squares 
criterion is formulated as an expectation: 

z = arg.g min^ [(£ — z) T (z — z)\m\ . (3.40) 

The orthogonality principle still applies, but now in a statistical form, rather than 
strictly geometrical. That is, orthogonality states that the optimum estimator must 
make the error uncorrelated (orthogonal) to any function of the measurements: 

E [(z - z)f(m) T ] = for any function /. (3.41) 

This makes sense: if there were any residual correlation remaining between the mea- 
surements and the estimation error, then we should be able to formulate a better 
estimator by taking advantage of this correlation. 
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Bayesian Least Squares 

The general Bayes least-squares estimator (2.79) is easily derived. For simplicity and 
clarity, we derive the estimator for the scalar case: 

z = arg 5 min {E[(z - z) 2 \m}} (3.42) 

= * gB rinfi2-z)Mz\m)dz. 0.43) 

To minimize this expression we set the derivative to zero: 

9 r,. , 2 _ 



— / (z - zyp(z\m)dz = 2(z- z)p(z\m)dz = (3.44) 

Jz P (z\m) d z = Jz P (z\m)dz (3.45) 

z E[z\m] 

That is, the optimal estimator is just the conditional mean of z 9 given the measure- 
ments m. The result for the vector case is identical: 

z = E[z\m] (3.46) 

This simple result is the consequence of the quadratic (least-squares) criterion, which 
simplifies when differentiated. However, the simplicity of this estimator (3.46) is 
deceiving: the conditional mean of z may be an extremely complicated, nonlinear 
function of m, motivating the development of the simpler Bayesian linear least- 
squares estimator, below. 

We conclude with three properties of the Bayesian estimator. 

• Bias: The Bayesian estimator is always unbiased: 

b = E[e\m] = E[z - z\m] = z - E[z\m] = 0. (3.47) 

• Orthogonality: The estimation error is orthogonal to (uncorrelated with) any 
function /() of the measurements: 

E [e • f(m) T ] =E[(z-z)- f(m) T ] = 0. (3.48) 



• Linearity: If the measurements m and the unknowns z are jointly Gaussian, then 
the estimator z is a linear function of m. 
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Bayesian Linear Least Squares 

Because of the complexity in evaluating the expectation (3.46) in the general Bayesian 
estimator for very large estimation problems we are motivated to find a simpler ap- 
proach. In particular, we would like a linear estimator which can be expressed in 
explicit, closed form. 

The Bayesian linear least-squares estimator (LLSE) is that linear estimator which, of 
all possible linear estimators, minimizes the squared error. Given that the estimator 
is linear, it must be of the general form 

z = Am + a. (3.49) 

We derive A, a by asserting the following two criteria: 

1. Unbiasedness — the estimator z is unbiased: 

E[z-z}=0. (3.50) 

2. Orthogonality — the error in z must be uncorrected (orthogonal) with any linear 
function of the data: 

E[(z-z)(Fm + g) T ] = E[(z- z))E[(Fm + g) T ] VF,£. (3.51) 

Asserting these two criteria, we derive the constraints on F and g_: 

1. Unbiasedness: 

E[z-z]=E [Am + a-z\ = Ay Lm + a - n z = (3.52) 

therefore 

=>• a = n z — Ay Lm . (3.53) 

2. Orthogonality: from the unbiasedness condition of (3.50) 

E [(1 - z)] E [(Fm + g) T ] = 0. (3.54) 

We can therefore simplify the left-hand side of the orthogonality condition (3.51) 
as 

Q = E[(z- z)(Fm + g_) T ] 

= E[(Am + ii z - A^ m - z) (Fm + g) T ] 

= AE [(m - nJm T ] E T + AE [(m - M J] a" (3.55) 

- E[(z- ti z )m T ]F T - E[{z- ^c? 

= AA m F T + A • • a T ~ A zrn F T - • / , 
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where A denotes a covariance or cross-covariance 

Am = cov(ra) A zm = E[(z-jj^)(m-i£ m ) T ]. (3.56) 

Thus it is required that 

(AA m - A zm )F T = VF. (3.57) 

For this expression to be valid for all possible F, it follows that 

AA m -A zrn = => A = yl 2m yl- 1 . (3.58) 

The derived LLSE estimator is then the well known formula 

z = Am + a = n z +4 TO ^m(ni-iim)' (3.59) 

This estimator is actually quite intuitive: our estimate for z is given by the mean ji z 
plus an offset due to the "surprise" in the measurements: the difference (m — /i m ) 
between what was observed and what was expected, normalized by the expected 
random (noisy) variations A m in the measurements, and unit-converted by A zm from 
units of m to those of z. 

We can also derive the error statistics for the estimator: 

cov(i — z) 

= cov (A zm A^{m - M m ) - (z - u z )) (3.60) 

= (A zm A^)A m (A zm A-£) T - {A zm A^)A mz - A zm (A zm A m x ) T + A z 

(3.61) 

Thus the error covariance cov(if) is given by the prior covariance A z , minus a reduc- 
tion controlled by the quality and degree of relevance of the measurements: 

cov(D =A Z - A zm A^A^ m . (3.62) 

The estimator offers no or limited improvement, 

cow(z)~A z , (3.63) 

if the measurements are very noisy (A m large) or if the measurements are irrelevant 
by being uncorrelated with the unknowns (A zm ~ 0). 

3.2.1 Bayesian Static Problem 

The previous section derived the LLSE, but in terms of second-order statistics and 
not in terms of the parameters of the static problem. 

Given the static problem from Section 2.5.1: 

z~N{y,,P) m = Cz + v v~JV(0,R), (3.64) 

we can easily derive the relevant second-order statistics: 
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Mm = E [Cz + v\ = C^ z + = Cy, 

A z = P 

A zm = E[(z- l i z )(m- Um ) T ] =E[(z- U )(Cz + v-C Ul ) t ] = PC T 

A m = cov(ra - y m ) = cov(C(z - y) + y) = CA Z C T + R 

Inserting these statistics into the LLSE (3.59), (3.62) we find Form I of the static 
estimator: 

l(ra) =n + PC T (CPC T + R)- 1 {m-Cu) (3.65) 

P = cov(|) = P - PC T (CPC T + R^CP (3.66) 

By manipulating the above equations (using the ABCD lemma, (A.39)), there exists 
a second algebraically equivalent variant, Form II: 

z(m) =^+ (CFR^C + P" 1 )" 1 CFR^im - Cy) (3.67) 

P = cov(I) = (CFR^C + P" 1 )" 1 (3.68) 

The equivalence of these two forms is by no means obvious or intuitive, and the proof 
of the ABCD lemma, connecting these two forms, requires some effort. However, a 
number of comments on these two forms are appropriate, as the suitability of one 
form or the other will vary with context: 

Form and Derivation 

The structure of Form I is clearly linked to, and directly derived from, the LLSE, 
whereas Form II is structurally quite similar to regularized least-squares (3.32). 

Model Assumptions 

Form I is explicitly expressed in terms of P and R, and the invertibility of P 
and R is not assumed, thus this estimator may be appropriate for certain singular 
estimation problems. 

On the other hand, Form II is expressed in terms of P _1 and i? _1 . We show in 
Chapters 5 and 6 that in many cases a statistical model may directly specify sparse 
P _1 , rather than P. 

Computational Complexity 

In Form I, the matrix inversion has the dimensions of m, the measurements, 
whereas the inversion in Form II has the dimensions of z, the state. In those cases 
where the number of measurements and unknowns is significantly different, the 
choice of form may be a significant factor in determining complexity. 



3.2.2 Bayesian Estimation and Prior Means 

One final comment is in order before we complete our discussion of static estimation. 
All of the derived Bayesian estimators exhibit a separation between the deterministic 
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Example 3.2: Simple Scalar Estimation 



A reader, unfamiliar with Bayesian estimation, may find it useful to consider a 
simple scalar problem. Suppose we have the following prior and measurement 
models: 

z ~ (nz,cr 2 z ) m = cz + v i> ~ (0, <j 2 ). (3.69) 

The forward model then projects the unknown value z to measurement m: 




From (3.67) we know the solution to the inverse problem to be 

P C / \ 7 / \ 

z = n z + -o (m - c/n z ) = fi z + fc(m - /i m ) 
where, from (3.68), the estimator gain fc can be evaluated as 



(3.70) 



(3.71) 



The estimated value £ will therefore be a function of the degree to which the prior 
is asserted: 

Weak Prior (a z large) — ► p ~ a 2 /c 2 k ~ 1/c z ~ m/c 
Strong Prior (a z small) — > p <C cr 2 /c 2 fc « 1/c z — ► \i z 

as drawn below: 




z -2 'Strong ZWeak 
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and stochastic aspects of an estimation problem, where the deterministic portions are 
those aspects known a priori (such as the mean). 

We can define a mean-removed estimation problem 

z[ = z — \± m[ = m — Cn (3.72) 

which leads to the simplified (zero-mean) version of (3.67) 

z(m) = (C T R^C + P- 1 )' 1 C T R- 1 ™! . (3.73) 

By inspection, it is clear that we can recover the original, mean-included estimates 
just by adding back the mean: 

l = l' + M- (3.74) 

Since the original estimator (3.67) wasn't actually that complicated, why is this mean 
removal of interest? 

1 . It highlights and clarifies the important distinction between the deterministic and 
stochastic portions of an estimation problem, and that these two parts can really 
be processed separately. 

2. In the dynamic estimation context of Chapter 4 the deterministic-stochastic sep- 
aration still holds, however the equations become more complicated than in the 
static case, and so the simplification offered by a zero-mean assumption may be 
significant. 

3. In many problems of practical interest, the definition of "the mean" may be very 
ambiguous. In computing estimates of ocean temperature, per Example 3.3, is 
"the mean" an average over one day, over one year, over one century? In principle 
the ocean temperature is ever-changing, and probably obeys no mean. In practice, 
however, we are normally interested in producing estimates over a certain time- 
scale, so by removing the behaviours which vary over much longer time-scales 
we relieve ourselves of the burden of modelling those behaviours, leaving us with 
a simpler problem. 

We therefore adopt a relaxed policy with regard to knowing or specifying the mean 
in derivations and examples, with the understanding that a mean term can always be 
added to the estimates as a separate step. 



3.2.3 Approximate Bayesian Estimators 

As we are interested, throughout this book, in computationally efficient, approximate 
estimators, it is worth considering the effects of an approximation. 
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Example 3.3: Prior Mean Removal 



Suppose we wish to produce a map of ocean surface temperature. 

We start, for example, with the three 
days of ATSR data shown here. The tem- 
perature is difficult to model, because the 
observed temperature is based on the su- 
perposition of at least two effects: 

1. Short-term patterns, such as storms 
and cold fronts. 

2. Long-term patterns, for example that 
the tropics tend to be warm, with the 
oceans cooling towards the poles. 




If we wish to study short-term tempera- 
ture fluctuations, then the long-term vari- 
ations are a nuisance: they are an ad- 
ditional phenomenon to model, and fur- 
thermore the large temperature range be- 
tween poles and tropics means that sub- 
tle variations may be masked. 

Based on an entire year of temperature 
data, we can produce an approximate 
sea- surface temperature mean, as shown 
here. The image represents phenomena 
unchanging over the period of a year: the 
warm Gulf of Mexico, the cool water of 
the South Equatorial Current etc. 

Returning to the three days of ATSR data 
from the top figure, if we subtract out 
the estimated mean, then we arrive at a 
very different dataset, as shown. The dy- 
namic range of the data is much smaller 
(4 degrees here, 25 degrees above), and 
subtle warming and cooling variations 
can be seen, both in the tropics and else- 
where. The statistical behaviour of these 
mean-removed data (3.72) is much sim- 
pler than before; at this point we learn a 
model for (z — ji) and proceed. 
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Suppose that we have a zero-mean random vector z; then from (3.65)-(3.68) we 
know the forms of the optimum linear least- squares estimator. 

For any of a number of reasons, the matrices P, R may be difficult to invert or store 
in memory, and so we may wish to consider solving a very similar problem which 
yields similar (i.e., approximated) results: 

P~P, R~R => 1~|, P~P. (3.75) 

Whether the approximation is good or not is a difficult question, to be discussed in 
later chapters. However, in assessing an approximation it is important that we clearly 
distinguish three different error covariances: 

P = cov(i — z) The error covariance of the optimum estimator. 
P ~ cov(J. — z) The error covariance computed by the approximate estimator. 
Pz = cov(f — z) The actual error covariance of the approximate estimator. 

The optimality of the original estimator clearly requires that 

P < Pz, (3.76) 

however the relationship between P and P,Pz is much less clear. For example, the 
particular choices of P, R might lead to a naively optimistic estimator, P < P, or 
an unnecessarily pessimistic one, P > P%. 

That is, in assessing the approximate estimator it is crucial to distinguish between 
what it thinks its covariance is (P), and its actual covariance (Pz). 

Suppose we are given the standard static estimation problem 

z~Af(0,P) m = Cz + v v~Af(0 : R) (3.77) 

and we select an approximate estimator 

l=Km~Km. (3.78) 

Then the actual error covariance of this estimator is 

Pz = cov(l — z) = coy (Km — z) (3.79) 

= coy(KCz + Kv-z) (3.80) 

= coy((KC - I)z + Kv) (3.81) 

= (KC - I)P(KC - I) T + KRK T . (3.82) 

Although it looks a bit cumbersome, this latter expression (3.82) is an important 
result, allowing us to find the actual error covariance for any linear estimator. 



3.2 Bayesian Estimation 



73 



3.2.4 Bayesian / NonBayesian Duality 

Philosophically, the Bayesian and non-Bayesian perspectives lie at opposite ends of 
a spectrum, and there has been vigorous debate as to whether prior knowledge is of 
utmost importance, on the one hand, or statistically irrelevant, on the other. 

It is rather surprising, then, that a comparison of the Bayesian estimator (3.67) and 
the non-Bayesian least-squares estimators (3. 26), (3. 32) reveals an astonishing degree 
of similarity, despite very different formulations and optimization objectives: 

Weighted Least-Squares: 

Minimize \\m — Cz\\ w 

=> Kmd = {^wcy 1 C T Wm 

Regularized Least- Squares: 

Minimize \\m - Cz\\r-i + A \\Lz\\ 

=> KlR) = (^L T L + CTR^Cy 1 C T R- X m 

Bayesian Linear Least- Squares: 

Minimize E [(z — z) T (z - z)] 

=>i.(m) =M+ (P- 1 + C T R- 1 C)~ 1 C T R- 1 (m-Cu) 

The algebraic connections among these estimators can be made more explicit by 
rewriting the latter Bayesian estimator as 



i(m) = {P~ 1 + C T R- 1 C} 1 C T R- 1 (m-Cu L ) 

= {p- i +c T R- i cy 1 (^ T »- 1 - ■ °- x 



C T i? _1 m + ^ ^M 



Ul- 



[c-i] 



R- 1 

o p- 1 



J" [^ 



I] 



R- 1 

o p- 1 



(3.83) 
(3.84) 



in which case, viewed through the context of weighted least-squares (3.26), the 
Bayesian estimator of (3.84) is equivalent to 



minimize e 



R- 1 

o p- 1 



(3.85) 



That is, from an algebraic point of view, prior knowledge is equivalent to a measure- 
ment: both are pieces of statistical information which guide the estimator. A prior 
z ~ J\f(u.i P) ma y m us be interpreted as saying that we have a "measurement" ji of 
z, with the uncertainty in the measurement determined by P. 

This duality also leads to a revised, dual criterion for the Bayesian estimator. Our 
original Bayesian criterion from (3.40), 
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Minimize E [(z - z) T (z - z)] , 



(3.86) 



is reinterpreted in the context of weighted least- squares, using the dual formulation 
(3.84), as the minimization of 



{fe]-[?H 



]{ 


m 
X 


- 


C 
I 



'R- 1 

|_ o p- 1 

(m - CzfR-^m - Cz) + (m - I) T P _1 (m - z) 

\\m-Cz L \\ R - 1 + ||M-i||p-i • 



(3.87) 

(3.88) 
(3.89) 



That is, the Bayesian estimator really is analogous to a standard least- squares ap- 
proach, except that it simultaneously tries to minimize both the measurement and 
prior inconsistencies, weighted by their respective confidences R~ 1 ,P~ 1 . 



3.3 Static Sampling 



The notion of sampling was introduced in Sections 2.5.2 and 2.5.4, such that we wish 
to generate random samples of a random vector. 

If the prior statistics of a vector are known, we wish to generate prior samples from 

z~{&P). (3.90) 

Similarly, if additional constraints beyond the prior model (such as from measure- 
ments) are asserted, then random samples may be generated from the constrained 
vector 

z\m~(z,P), (3.91) 

where J., P are calculated via the Bayesian estimator (3.65)-(3.68). 

As we saw in (2.76), given the mean and co variance of a random vector, random 
samples can be found by computing a matrix square root (Appendix A. 8): 



Thus 



Given w~ I => cov (r T w) = E [r T ww T r] = r T IT = P. (3.92) 

H + P 1/2 w~(&P) (3.93) 

z + P 1/2 w~ (|,P) (3.94) 



are random prior and posterior samples, respectively. However, compared to com- 
puting estimates z 9 finding prior or posterior samples is much more difficult, due to 
the G(n 3 ) calculation of the matrix square root, in addition to the G(n 3 ) calculation 
of the posterior covariance P in the case of posterior sampling. 
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Example 3.4: Static Estimation and Sampling 

Following up on Example 2.9, suppose we consider an ensemble of random prior 
samples for a first-order problem (left) or second-order problem (right): 





A first-order problem corresponds to an exponential spatial autocorrelation, and 
the second-order problem very nearly to a Gaussian. Given two measurements, 
we can clearly observe the posterior samples to be constrained at the measured 
point, but have greater freedom away from measurements: 




Each panel shows the estimate (solid) and one standard-deviation envelope 
(dashed). It is clear that there is a relationship between the spread of posterior 
samples and the estimation error variance at any point. Clearly additional mea- 
surements more tightly constrain the problem: 





Note how the second-order correlation leads to a very smooth interpolation (but 
also poorly conditioned), whereas the exponential (first-order) case leads to piece- 
wise linear estimates. 
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3 Static Estimation and Sampling 



An estimation problem which is densely sampled with low- variance measurements 
leads to P close to zero, in which case the random posterior samples are nearly iden- 
tical to the estimates z. Conversely, if there are no measurements, then the prior and 
posterior samples both come from the same distribution, since without measurements 



z — Ul 



P = P. 



(3.95) 



If a non-Bayesian static estimation problem is given, as in Section 3.1, then a prior 
model may not exist if the estimation problem is ill-posed in the absence of mea- 
surements. In such cases a formal statistical prior sample does not exist, although in 
Section 1 1.2 we show that typical samples can still be generated. 



3.4 Data Fusion 



From the discussion in Section 2.1 on data fusion, we wanted to allow the possibility 
of multiple measurement sets, rather than just a single measurement vector. Such 
problems occur frequently in remote sensing, for example, as shown in Figure 3.2, 
where the same scene is imaged by multiple instruments, possibly at different spatial 
or temporal resolutions; similar examples occur in computer vision, such as a stereo 
pair of cameras, and throughout medical imaging, for example combining data from 
two or more of ultrasound, PET, MRI, and CT images. 

In many cases a data fusion problem can be rewritten as a more complicated regular 
static problem. That is, data fusion problems are not inherently distinct from other 
estimation or sampling problems; therefore, with the exception of this section, we do 
not discuss data fusion separately. 

Recall from (2.8) in Section 2.1 the basic premise: we are given multiple measure- 
ments 



m 1 = C\z + v ± 
UL.2 — C2Z + v 2 



Hi ~A/"(Q,#i) 
v.2 ~A/"(0,# 2 ) 



(3.96) 
(3.97) 



It is possible to stack these multiple measurement models into a single one: 



•A/ - 0, 



m 1 


= 


~C{ 

c 2 


z + 


V 

V 2 


V = 


V 

V 2 



Ri ..." 


\ 


R 2 




: '••_ 


/ 



(3.98) 



which is a basic static problem, where the block-diagonal nature of the measure- 
ment noise follows from the assumption that the various measurement noise terms 
v_ x , v_ 2 , . . . are uncorrelated. It is important to note, however, that the conversion back 
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10 15 20 

Longitude East 

Fig. 3.2. An example of Bayesian estimation and data fusion [109]: A static Bayesian estimator 
(based on Section 10.3) produced these estimates of sea-surface height in the Mediterranean 
Sea. The estimates are based on fusing the altimetric (height) data from two complementary 
satellites: Topex-Poseidon [120], which samples more sparsely in space but more frequently 
in time, and ERS-1 [203], which samples more densely in space but less frequently in time. 



to the static form is not dependent on noise decorrelation, or even on the linearity of 
the measurement model. Given nonlinear measurements having correlated noise, 



m 1 = C\ (z) + y_i 


Ui~.A/XQ,Uii) 


(3.99) 


m 2 = C 2 (z) + v 2 


v 2 ^J\f(0,R 22 ) 


(3.100) 



the problem transforms as in (3.98): 



m l 

m 2 


= 


'Ci(z)- 

C2(Z) 


+ 


V 2 


V = 


V 2 

















-AT 0, 



Rn R\2 
R21 R22 



(3.101) 



A more interesting case is one in which the measurements arrive separately. Suppose 
we first receive a set of measurements, from which we compute estimates: 



Initial prior z ~ (0, P) 

First measurement m 1 = C\z + v_ x 



z\m x - (ii,A). 



(3.102) 



Then, suppose that afterward a second set of measurements becomes available. We 
know that we could combine the first and second set together, as in (3.98), however 
that requires that we hold on to the previous measurements m 1 indefinitely, and also 
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it throws away the work that we did in (3.102) to compute the estimates. Really, we 
want to fuse the new measurements into the existing estimates. Actually this is not 
so hard, because we can view the posterior (z ± , Pi ) from the first stage as the prior 
to the second: 



Current "prior" z\m 1 ~ (i 1? P\) 

Second measurement m 2 = C2Z + v 2 



z\m 1 ,m 2 - (| 12 ,Pi2). (3.103) 



This latter step is just another, regular static problem, where we can see that the 
calculation of the error statistics P\ in the former step was essential to fuse the later 
measurements. 

The above, sequential process in (3.102), (3.103) is something like a dynamic prob- 
lem, in that measurements are arriving one after another, however the underlying 
state z is not changing, so there are actually no dynamics present. Nevertheless, this 
example has much in common with dynamic estimation, and the above idea leads 
particularly nicely into the dynamic Kalman filter, derived in Chapter 4. 



Application 3: Atmospheric Temperature Inversion [282] 

There are a number of scientific challenges which follow the generic inverse problem 
very closely: 

Underlying State — - — ► Measurements — ► Estimated State 

One such example is the temperature inverse-problem of the atmosphere [282]. Al- 
though temperatures can be measured directly at one location using a weather bal- 
loon, to produce a map of temperature with large-scale coverage it would be prefer- 
able to use a satellite or high-flying aircraft to infer the temperature remotely. 

Underlying State: 

We wish to estimate the atmospheric temperature t(a) as a function of altitude 2 a. 

Forward Problem: 

It is well known that every molecule (such as oxygen O2) is associated with a 
line (emission / absorption) spectrum. The spectral line is not infinitesimally thin, 
however, rather it is a smeared line whose width is a function of pressure (the Stark 
effect), as sketched in Figure 3.3(a). Furthermore, the strength with which Oxygen 
radiates energy is a function of its temperature. Therefore a satellite, far above 
the atmosphere, observing the microwave energy at 1 16 GHz will be able to infer 



2 Technically the temperature is normally expressed as a function of pressure, however ig- 
noring spatial variations in atmospheric pressure, there is a one-to-one relationship between 
pressure and altitude. 



Atmospheric Temperature Inversion [282] 
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Altitude 
20 km 





Weight 




Freq. (GHz) 



116 118 120 122 

(a) 
Spectral Line Width vs. Altitude 



(b) 

Altitude Weighting Functions 

for Different Frequencies 



Fig. 3.3. The forward problem for atmospheric temperature inversion. The width of the oxygen 
118 GHz spectral line, left, varies with pressure (or, equivalently, altitude). By solving the 
process by which microwave radiation moves through the atmosphere, right, we obtain a set 
of curves, each describing how much a given frequency is influenced by temperature at a given 
altitude. 



ground-level temperature, since at all altitudes above the ground the atmosphere 
is transparent to 1 16 GHz. 

The problem is a bit more complicated, however. The spectral power around 
1 18 GHz, closer to the spectral line, is not just the average of atmospheric temper- 
ature from km to 15 km. Instead, a microwave photon emitted near the ground 
may be absorbed, and re-emitted, by another molecule higher in the atmosphere 
(the so-called radiative transfer problem). If this process is simulated, we can find 
the weighting functions w(a, f) at altitude a and frequency /, such that the spec- 
tral strength observed from space is 



*(/) 



t(a) ■ w(a, f) da. 



(3.104) 



Normally our satellite would observe the spectrum at only a discrete number of 
frequencies, as suggested by the sketch in Figure 3.3(b). 

Inverse Problem: 

We suppose that we wish to estimate temperatures at discrete altitudes a±, . . . , a n . 
Furthermore we suppose that the satellite observes the spectrum at frequencies 
/i , . . . , f q . Our state and measurement are therefore defined, 



(3.105) 



't(ai)~ 


m = 


's(h) 


t(a n )_ 




Uf q ) 



leading to a forward problem 
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m = Wz + v. (3.106) 

W follows from the weights in Figure 3.3(b) and v is a noise term which will 
be a function of the instrument design. Normally this problem will be undercon- 
strained, so we want to regularize z, most likely by asserting smoothness. 

Because this inverse problem is one-dimensional (a single look down through the 
atmosphere), the number of unknowns is modest and the problem can be solved 
directly by matrix inversion. Analytical methods and neural networks have also 
been used. 



Summary 

Given a least- squares problem m = Cz, our objective is to minimize the weighted 
squared error 

z = arg 2 min {(ra — Cz) T W(rn — Cz)} (3.107) 

for which the optimum solution is given by 

z= (C T WC) _1 C T Wm. (3.108) 

Given a constrained, regularized problem our objective is to minimize the weighted 
squared error 

min{ ||ra - Cz\\ R - 1 + A ||L||| } (3.109) 

for which the optimum solution is given by 

z = (C T ^ _1 C + AL T L) _1 C T R- X m. (3.110) 

Given a static Bayesian estimation problem 

z~{lL,P) m = Cz + v v~(0,R), (3.111) 

the optimum estimator can be formulated in the following two algebraically-equivalent 
ways: 

Form I: 

|(ra) =n+(PC T )(CPC T + ^) _1 (m - Cy) (3.112) 

P = cov(I) = P - (PC T )(CPC T + R^CP (3.113) 

Form II: 

z(m) =n+ (P- 1 + CTR^Cy 1 C T R- 1 {m - Cy) (3.114) 

P = cov(i) = (P- 1 + ^R^Cy 1 (3.115) 



S ample Problems 8 1 

For Further Study 



There are a great many textbooks that cover estimation theory; as an accessible intro- 
duction, the reader is referred to [248, 284]. For a comprehensive coverage of spatial 
statistics, the book by Cressie [75] is an excellent reference. 



Sample Problems 

Problem 3.1: Static Estimation Forms 

Use the ABCD identity from Appendix A to prove the equivalence of the two 
static estimator forms (3.1 12)-(3.113) and (3.1 14)-(3.1 15). 

Problem 3.2: Linear Regression 

In Example 3.1 on page 62, the answer to the estimate z is written down, but 
without derivation. Complete the example by deriving (3.20), the solution to a, b. 

Problem 3.3: Static Sampling 

Suppose you are given a vector ^ and a covariance P. Prove that 

(/i + P 1 / 2 ^) is distributed as Qi, P), 
where w~ I. 

Problem 3.4: Covariance Reduction 

A measurement should never increase our uncertainty in an estimate. At worst a 
measurement is useless and should have no effect, in all other cases the estimation 
error covariance should decrease from the prior. Prove that this is so. 

That is, for Bayesian least squares prove that P < P, where the matrix inequality 
is understood in positive- semidefinite terms; in other words, that P — P must be 
positive-semidefinite. The proof is easiest beginning with Form I (3.1 13). 

Problem 3.5: First-Order Interpolation 

Let z be a vector having 100 elements. We wish to do regularized, constrained 
estimation as in (3.109), (3.110). 
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We assert a first-order (gradient) constraint in L, penalizing the differences be- 
tween adjacent state elements. Thus L is a 99 x 100 matrix such that 

f-1 ifj=i 
k,j = < 1 if j = i + l 
[ otherwise 

meaning that each row of L looks like 

[0 ... -1 1 . . .] . 



(a) How many measurements are required for the constrained problem to satisfy 
uniqueness? Why? 

(b) Suppose we observe elements Z25 and Z75, thus 



Z25 

Z7b 



v where m 



v ~ I. 



m = Cz + v 

Compute and plot z for A = 1, 5, 25. 

Problem 3.6: Second-Order Interpolation 

Let z be a vector having 100 elements. We wish to do regularized, constrained 
estimation as in (3.109), (3.110). 

Following on Problem 3.5, we assert a second-order (curvature) constraint in L, 
penalizing the curvature (rate of change of gradient) across three successive state 
elements. Thus L is a 98 x 100 matrix such that 

{-1 if j = m + 2 
2 ifj = i + l 
otherwise 

so that each row of L looks like 

[0 ... -1 2 -1 . . .] . 



(a) How many measurements are required for the constrained problem to satisfy 
uniqueness? Why? 

(b) Suppose we observe elements £1,2:50, and zioo with 



m 



T 



[0 50 25] v~I. 



Compute and plot z for A = 250, 2000, 10000. 
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(c) Comment on the differences in the estimates produced by first- and second- 
order constraints. 

Problem 3.7: Cross- Validation 

We wish to use the method of cross-validation (2.56) on Page 37 from Chapter 2 
to infer the optimal regularization constraint A in Problems 3.5 and 3.6. A part 
of this problem was solved in Example 2.8; the reader may wish to refer back to 
this example. 

Suppose we have three measurement sets: 

(!) c i (!) 

(2) i , (2) 



-! 3) =(^f+4 3) 



(i) 

where 1 < i < 100 and where v\ ' is a white noise process of variance 2.0. 

For each of the measurement sets, and for each of the first- and second-order 
constraints, perform cross-validation. 

Plot the measurement-estimate inconsistency e 2 (A) from (2.56), find the optimal 
value of A, and plot the estimates corresponding to A/100, A, and 100A. 

Comment on the behaviour of the e 2 (A) and estimation plots. 

Problem 3.8: Data Fusion 

Let's investigate doing data fusion. Suppose z is a vector of 100 unknowns, where 
we observe elements zio, £30, £50, £70, and Z90 with 

m T = [0 20 30 0] v~ I. 

The data fusion problem needs a prior; we use the second-order constraint of 
Problem 3.6, such that the prior P _1 = \L T L, where A = 2000. We'll compute 
estimates in one of two ways: 

(a) Using usual constrained least squares, compute the estimates z(rn) and esti- 
mation error covariance P. Plot the five measurements, superimposed on the 
estimates, plus-minus one standard deviation 



|±^diag(P). 

(b) Now we want to do data fusion, whereby we will incorporate only one mea- 
surement at a time, using the method outlined in (3.102), (3.103): 

1. Compute |. 2 , P2 from measurements mi, 7712, using prior P as before. 
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2. Compute J. 3 , P3 by fusing measurement 7773 into z 2 ,p2- 

3. Compute I4, P4 by fusing 777,4 into i 3 , P3. 

4. Compute J. 5 , P5 by fusing 7775 into J. 4 , P4. 
Plot 



|,±^diag(P,), 
superimposed on the relevant measurement, for each iteration i = 2, . . . , 5. 

State your observations, comparing the results from (a) and (b). 



Dynamic Estimation and Sampling 



Chapter 3 developed estimators for static problems — those in which z has no time 
dependence. Given measurements, we compute a corresponding set of estimates, and 
the solution is complete. 

Clearly the problem is much more interesting if z is permitted to evolve and be mea- 
sured over time. Indeed, this is a classic problem in the control literature, in which 
we wish to control some aspects (such as temperature, velocity, pressure, height, vol- 
ume, etc.) of a time-evolving system z. A common step in developing a controller is 
to develop an "observer," to estimate i, the unknown current state of the system. The 
Kalman filter [182] is a dynamic, recursive estimator which was first developed as 
an observer for system control, but which has found fantastically broad application 
in a wide variety of fields, including statistical image processing. 

Although temporal image processing is certainly an area of interest, particularly for 
video sequences, it is important to realize that the "time" t may be interpreted much 
more abstractly: any signal which varies as a function of a discrete variable is a candi- 
date for Kalman filtering. Thus, in the context of multidimensional signal processing, 
we may choose t to index over time, over video frames, over the rows or columns of 
an image, over the planes in a 3D volume, over the resolutions in a multiresolution 
tree, or the raster scan of a 2D image, to name only a few possibilities. Three real, 
practical examples are illustrated in Figure 4.1. 

To be sure, derivations of the Kalman filter may be found in many texts [8, 97, 151, 
156, 231, 284]. We repeat one here, partly for reasons of continuity in development, 
but also because many other derivations are unnecessarily complicated, and may 
include continuous-time derivations, which are important in control, but have little 
to contribute to the inherently discrete world of pixellated images. 

Our discussion begins with a consideration of first-order Gauss-Markov models, and 
of the relationship between static and dynamic estimation problems, followed by a 
derivation of the discrete-time Kalman filter. 
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Bands of Satellite Measurements of Oceanic Surface Temperature 




Time t = Time t = 12 Hours Time t = 24 Hours 

(ATSR Data from the Rutherford Appleton Laboratories) 

Medical Images of the Human Brain 




MRI Slice 75 



MRI Slice 79 MRI Slice 86 

(MRI Data from the Bay crest Centre for Geriatric Care, Toronto) 



Microscopic Measurements of a Porous Medium 








Resolution Scale 1 Resolution Scale 2 Resolution Scale 3 

(Microscopic Data from M. Ioannidis, Dept. Chemical Engineering, University of Waterloo) 

Fig. 4.1. There are many circumstances in which we have image sequences. Whether the 
sequence is indexed over time (top), over space (middle), or over resolution (bottom), all of 
these can be modelled as dynamic processes. 



4.1 The Dynamic Problem 



We encountered a basic, canonical dynamic problem in Section 2.5.1. Our starting 
point here is the most famous of all linear systems, the first-order Gauss-Markov 
dynamic model [8, 248] 



z(t + 1) = A(t)z(i) + B(t)w(t) w(i) - A/"(0, J), 



(4.1) 
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in which the random process w is white and uncorrelated with the past of the system: 

E [w(t)w(s) T ] = 6t, a I, E [w(t)z(s) T ] =0 if s < t. (4.2) 

The interpretation or purpose of the process noise w can be context dependent: 

• In some cases our dynamic system z(t) is actually stochastic, and really is driven 
by a white noise process, such that w is adding energy into the system. This is 
true for a variety of physical phenomena, the most famous of which is the random 
walk of Brownian motion. 

• Our mathematical model for z(t) may be deterministic, however we may still 
choose to include a noise term w(t). A deterministic estimator will become in- 
creasingly confident over time, to the point where it will refuse to adjust the es- 
timates to changing measurements. A small amount of modelled noise limits the 
estimator confidence and preserves adaptability. 

• The mathematical model A(t), B(t) may, in many cases, be only an approxima- 
tion of the real world. For example, in developing a model for the motion of a car, 
details such as variations in tire pressure, small bumps on the road, wind, and tur- 
bulence may not be modelled. Although more formal methods exist for analyzing 
and dealing with model errors, a noise term in the dynamics gives the estimator a 
bit of flexibility in accommodating model approximation. 

Next, a boundary condition is required to initiate the recursion: 

E\z(0)] = z coy(z ) = P . (4.3) 

Finally, measurements are available over time: 

m(t) = C(t)z(t) + v(t) v(t) ~ A/"(0, R(t)) . (4.4) 

Taken together, the dynamic equation, boundary condition, and measurements form 
a complete description of the canonical dynamic problem from Section 2.5.1. Let 

z(t\s) be the least-squares estimate of z(i) given m(0), . . . , m(s), and 
P(t\s) be the estimation error covariance corresponding to z(t\s). 

Then, given ra(0), . . . , rn(r) we have two fundamental problems: 

Filtering: find z(t\t) for < t < r 

Smoothing: find z(t\r) for < t < r 
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4.1.1 First-Order Gauss-Markov Processes 

The proposed model (4.1) is a first-order, Gauss-Markov, discrete-time random pro- 
cess. At first glance it is unclear to what extent this model has a sufficiently general 
form. For example, two simple, common, alternative models which do not immedi- 
ately fit into the form of (4.1) are an nth-order autoregressive model [37] 



n-l 



l{t + 1) = ^2 Ai(t)z(t -i) + B(t)w(t) 



i=0 

or an nth-order correlated-noise (moving-average) model 

n-l 



z(t + I) = "}2 Bi(t)w(t - i). 



(4.5) 



(4.6) 



i=0 



In fact, both of these forms can be rewritten into the form of (4.1), thus any estimator 
compatible with (4.1) does in fact generalize to a much broader range of models. We 
illustrate this generalization for two examples. Given an autoregressive model of the 
form (4.5) 



z(t + 1) = ^ Aityzti ~ + B(t)w(t), 



(4.7) 



i=0 



then under the following definition of variables 



m 



' z(t) ' 




~A (t) Ai(t) A 2 (t)~ 




~B(t) 


z(t - 1) 


A{t) = 


J 


B(t) = 





z(t - 2) 




0/0 








(4.8) 
we obtain a first-order dynamic recursion, having the canonical form of (4.1): 

z(t + 1) = A(t)z(t) + B(i)w(i). (4.9) 



A second example illustrates the same idea for the moving-average process of (4.6). 
Given the process 



z(t+l) = ^Bi(t)w(t-i), 



i=0 



we can formulate a change of variables 



m 



z(t 


) 


w(t- 


1) 


w(t- 


2) 


w(t- 


3) 



A(t) 



Bi(t) B 2 {t) B 3 {t)' 

0/00 
0/0 



B(t) 



(4.10) 



'B (t)~ 
I 



(4.11) 



4.1 The Dynamic Problem 89 

such that the following Gauss-Markov model implements the moving average of 
(4.10): 

z(t + 1) = A(t)z(t) + B(t)w(t). (4.12) 

Thus we claim that, indeed, the proposed first-order Gauss-Markov model (4.1) can 
encompass a wide variety of models, both first- and higher-order. 

Next, it should be clarified that the entire prior model over all time t > is specified, 
implicitly, by the boundary condition at time t = 0. Given 

E[z(0)] =z cov(z )=P , (4.13) 

we can find a recursive form for the mean E [z(t)\ , 

E[z(t + 1)] = E[A(t)z(t) + B(t)w(t)] (4.14) 

= A(t)E[z(t)] (4.15) 

= A(t)A(t-l)...A(0)z o (4.16) 

and similarly for the covariance P(t), 

P(t + 1) = cov(A(£)z(t) + B(t)w(t)) (4.17) 

= A(t)P(t)A T (t) + B(t)B T (t). (4.18) 

If the system dynamics are time-invariant, A(t) = A, and stable, meaning that all of 
the eigenvalues of A have a magnitude less than one, then the process converges to 
a statistical steady-state. That is, z(t) continues to change and evolve, but all of the 
statistics of z(t) converge. In particular, the mean converges to zero, 

lim E\z(t)] = lim A% = 0, (4.19) 

and the process covariance converges to the solution of the Algebraic Lyapunov 
Equation: 

lim P(t) = P^ where P^ = AP^A 1 ^ + BB T . (4.20) 

The steady- state case is of some interest in Kalman filtering and will be revisited in 
Section 4.3.2. 



4.1.2 Static — Dynamic Duality 

There is nothing inherently difficult in estimating a dynamic quantity. In fact, it is 
possible to convert the dynamic problem into a static one, and then to use the methods 
of Chapter 3 which we have derived for static estimation. 

Consider the problem of estimating z(t\t). Suppose we create augmented state and 
measurement vectors 
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n*(o)i 



m 



0). 



m(t) 



ra(0)' 



m(t) 



(4.21) 



Then solving the estimation problem produces estimates 

ri(oit)i 



m 



z(t\t) 



(4.22) 



of each element of z given all of the measurements in m, where the estimates include 
the desired z(t \ t). 

If we concatenate the dynamic measurement model as 



m = 



where C(t),R(t) are block-diagonal, then we have recovered the measurement 
model corresponding to the following static problem: 



■«(0)" 


C(t) = 


C(0) 




R(t) = 


-R(0) 






At). 






C(t)_ 






R(t) 
















(4.23) 



m = Cz + v 



1~%P) £~(Q,i?). 



(4.24) 



What remains is the more difficult rearrangement of the dynamic model for z(t) to 
infer the prior covariance P of the above static problem. We begin with the straight- 
forward computation of the prior mean; from (4.16) it follows that 



m = E[m] = e 



'*(oy 

2(1) 

0). 



A(t-l)-...-A(0)^ 



(4.25) 



Next, to compute the prior covariance P, observe that 



E 



(z(t+i)-u(t+i))(m-u(t)) 

= E [(A(t)z(t) + B(t)w(t) - A(t)u(t)) (z(t) ~ U.(t)Y 

= A(t)E[(z(t) -u(t))(z(t) -n(t)) T ] +B{t)E[w{t){z{t)-mY 
= A(t)P(t) + 0. 



(4.26) 



Iterating (4.26), we can find all needed cross-correlations 



(z(t + s) - ti[t + s))(z(t) - u(t)Y 
(z(t) - utf)) (z(t + s)- u{t + s)Y 
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A(t + s - 1) • . . . • A(t)P(t) 

P(t)A T (t) • . . . • A T (t + 8-1) 

(4.27) 



for 8 > 0, from which follows the massive covariance P(t) = cov(z(£)) : 



p(t) = 



P(0) P(0)A T (0) P(0)A T (0)A T (1) 

A(0)P(0) P(l) P(1)^ T (1) 

A(1)A(0)P(0) A(1)P(1) P(2) 



P(t) 



(4.28) 



With the definition of the static problem complete, we can use the results of Sec- 
tion 3.2.1 to find the estimates: 



l(0\t) 
z(t\t). 



:^+(C T ^- 1 C + p- 1 ) 1 C T R- 1 (m-Cu L ), 



(4.29) 



from which the desired estimate, z(t\t), can be extracted as the last element of z_. 

The problem with this dual approach, which becomes obvious when looking at 
(4.28), is that the size of the vectors and matrices grows with increasing t. That 
is, the problem becomes increasingly difficult over time. For example, if the system 
state is n-dimensional, z(t) G M n , then the computation of z(t\t) involves vectors of 
length (t + l)n and a matrix-inversion complexity of O ((tn) 3 ) . 

There is also something dissatisfying about this dual approach, in that there is a great 
deal of duplicated effort in computing z(t\t) and z(t + l\t + 1). In particular, observe 
that the problem formulation at time t + 1 shares a great deal in common with that 
at time t: 



C(t + 1) 



C{t) 
C(£ + l) 

P(t + 1) 



Pit) D 
D T P(t+1) 



R(t) 
R(t+1) 



(4.30) 
(4.31) 



The key to finding a more efficient estimator is to look again at the original recursive 
dynamics in (4. 1). In particular, note that the state of the system at time t + 1 depends 
only on the state at time t; that is, the state z(t) somehow captures or summarizes 
the entire past history of the process 1 that is relevant to z(t + 1). The derivation in 
Section 4.2 shows that such a recursive form also exists for the estimator z(t). 



1 Which is essentially a statement of Markovianity, a subject which is discussed further in 
Chapter 6. 
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Example 4.1: Dynamic Estimation and Sampling 

Following up on Examples 2.9 and 3.4, suppose we have two random processes, 
where B in (4.1) controls the amount of process noise: 



Small B 



Large B 



z(t\t) 



m>) 




Both processes are scalar, first-order, with A w 0.99. The solid line plots the esti- 
mates, and the dashed lines the one standard-deviation envelope. Clearly the rate 
at which the uncertainty grows is a function of B. Now suppose a measurement 
is introduced: 



z(t\t) 




Because the estimates are causal, z(t\t) is unaware of any measurement until the 
measurement occurs. Each measurement constrains the posterior samples of the 
random process, with multiple measurements introducing multiple constraints: 



z(t\t) 




^ Jk 



Solving for these dynamic estimates is not hard, and follows from (4.29). What 
is unique about the Kalman filter, derived in Section 4.2, is an efficient, recursive 
approach to producing dynamic estimates such as these. 
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4.2 Kalman Filter Derivation 



We seek to discover a recursive form for the estimator z(t\t). It is common to en- 
counter Kalman filter derivations in textbooks which assert the recursive form, 

z(t + l\t + 1) = H(i)z(t\i) + K{t)m{t) + a(t), (4.32) 

and then derive the optimum settings for H (t) , K (t) , a(t) . However, such an ap- 
proach does two disservices to the reader: 

1. By asserting the recursive form ab initio, one finds only the best recursive form, 
without necessarily showing that the recursive form is, in fact, the best overall 
estimator. 

2. By asserting the recursive form, the opportunity is lost to illustrate to the reader 
why a recursive form is optimal. That is, what is it in the structure of the problem 
which allows a recursive form to emerge? 

The remainder of this section, which borrows from the ideas and structure of [333], 
is divided into three parts: 

1 . Indirect estimation, 

2. Uncorrelated measurements, and 

3. Recursive estimation. 

Although there are insights to be gained by understanding the derivation in detail, 
the reader may choose to skip the following on first reading. 



I. Indirect Estimation 

Suppose we are given the following problem: 

r> , arm o\ z ~ AI{0,P). (4.33) 

£ = Dz + w w~ M (0, S) ~ v_ ' y 

That is, m is a known measurement of z and £ is unknown, where the noise processes 
v, w are uncorrelated with each other and with z. We know the forms of the static 
estimators: 

z(m) = A zm A^m g(m) = A qm A^m. (4.34) 

The required cross-statistics are easily derived: 
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A zm = E[zm T ] = E[z(Cz + v) T ] = PC T 
A qm = E [am T ] = E [(Dz + w)(Cz + v) T ] = DPC T 
A q = E[<i<i T ] =E[(Dz + w)(Dz + w) T ] = DPD T 
Thus the indirect estimator may be expressed as 

a(m) = DPC T A^m = Dz(m). 
Similarly, the indirect estimation error covariance obeys 



P q = A q 



A A' 1 ^ 



S. 



s. 



= DPD T + S - DPC T A^CPD T 

= D(P- PC T A^CP) D T + S = DP Z D T 

Thus we have the elegant conclusion that if we have an estimator z for z, 
estimator for any linear function of z is just that linear function of z: 



a = Dz + w 

w - (0, S) 



Q,(UL) — Dz(m) 
A = DP Z D T + S. 



This relationship will be required in order to relate estimates at time t + 1 
of estimates at time t. 



(4.35) 
(4.36) 

(4.37) 

(4.38) 

(4.39) 

(4.40) 

(4.41) 

then the 

(4.42) 
in terms 



II. Uncorrelated Measurements 



Next, we wish to produce estimates based on two groups of measurements. 

Suppose we wish to estimate zero-mean z based on measurements divided into two 
uncorrelated groups m a ,rn b . Note that we are making the rather unusual insistence 
that the measurements be uncorrelated, not just the measurement errors. That is, the 
statistics of the measurements obey 



Am = cov(m) 



A a 
A b 



where m 



ma 



(4.43) 



Next, define the cross-statistics between the measurements and unknowns as 

A zm = E[zm T ] =E[z [m T a ml]] = [A za A zb ] . (4.44) 

Then, using the standard LLSE form, we can derive the estimator for z: 

z(m) = A zm A m x m = [A za A zb ] 



[A za A 



zb 



'Aa 0" 


-1 




IRa 


A b 




m b _ 


''A- 1 
A 


-l 

b _ 




m b _ 



= A za A a x m a + A zb A b x 

= £(ma) + i(m)- 



m h 



(4.45) 

(4.46) 

(4.47) 
(4.48) 
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That is, if two measurement sets are uncorrelated, then the optimum estimator is just 
the sum of the estimates produced by each measurement set. 



This result is important, in that we set m a to be the entire past history of measure- 
ments, and m b to be the current measurements at time t, which will thus lead to a 
recursive estimator 

z(t\t) = J.(all past measurements) + ^measurements at time t). (4.49) 

The measurements are normally not uncorrelated over time, therefore the key is to 
determine how to decorrelate them. 

III. Recursive Estimation 

Because (4.1) has the form of (4.33), it follows that 

z(t\t - 1) = A(t - l)z(t - l\t - 1) (4.50) 

P(t\t - 1) = A(t - l)P(t - l\t - l)A T (t - 1) + B(t - l)B T (t - 1). (4.51) 

This is known as the prediction step — the prediction over time of the estimates and 
related statistics. 

In addition to predicting estimates over time, we also need to incorporate new mea- 
surements as they arrive, known as the update step. The key is the creation of an 
innovations process v_(t) : 

v(t) = m(t) - m(t\t - 1). (4.52) 

Although it may seem odd to construct estimates of measurements, m(t) is a random 
vector which obeys certain statistics and can be estimated, just like any other. From 
its definition (4.52), v_(t) contains the information present in m(t) which cannot be 
inferred from past measurements ra(0), . . . , m(t — 1). That is, m(t\t — 1) is the 
predictable part of m(t), and v(t) contains only the new information in m(t). 

Because v_(t) is the estimation error in m(t\t — 1), by the orthogonality principle 

E [v{t] m T (s)} =0, < s < t. (4.53) 

That is, we have two decorrelated measurement sets: v_(t) and m(0), . . . , m(t — 1). 

Next, because (4.4) also has the form of (4.33), again it follows that 

m(t\t - 1) = C(t)z(t\t - 1) (4.54) 

thus the innovations can be written as 
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v{t) = m(t) - m(t\t - 1) = C(t) [z(t) - z(t\t - 1)] + v(t) (4.55) 

P v (t) = C(t)P(t\t - l)C T (t) + R(t). (4.56) 

Finally, let e(t) represent the predicted estimation error 

z(t) = z(t\t-l) + e(t). (4.57) 

By orthogonality, e(t) must also be uncorrelated with m(s), < 5 < t. Therefore 

z(t\t) = z(t\t - 1) + e(t|t) (4.58) 

= l(^ - 1) + e(t|m(0), . . . ,m(t - 1),^)) (4.59) 

= z(t\t - 1) + e(t|m(0), . . . , m(t - 1)) + e(t|i/(t)) (4.60) 

= l(t|t - 1) + Q + e(t|i/(t)) (4.61) 

= z(t\t-l) + P eu p- 1 v(t), (4.62) 

where (4.58) follows from I, (4.60) follows from II, (4.61) from orthogonality, and 
(4.62) from static estimation. 

All that remains to complete the derivation is to find the innovation statistics: 

P v (t) = C(t)P(t\t - l)C T (t) + R(t) (4.63) 

(4.64) 



P ev (i) = E \e(t)(m(t) - m(t\t - l)f 



= E 



'm(C(t)e(t)+v(t)f 



(4.65) 



= P(t\t-l)C T (t), (4.66) 

where the last relationship follows because the prediction error e(t) is uncorrelated 
with the measurement error v(i). 

Thus we have completed the derivation of the update step: 

z(t\t) = Predicted Estimate + Measurement Relevance • New Information 

z(t\t - 1) + K{t) ■ (m(t) - C(t)z(t\t - 1)) 

(4.67) 
where, from (4.63) and (4.66), 

K(t) = PevP- 1 = P(t\t - l)C T (t) (C(t)P(t\t - l)C T (t) + R^y 1 . (4.68) 
The corresponding estimation error covariance is 

P(t\t)= Predicted Uncertainty — Uncertainty Reduction due to Measurements 

P(t\t~l) PeuP^Pl 

P{t\t-l) K(t)C(t)P(t\t-l). 

(4.69) 
The resulting discrete-time Kalman filter is summarized in Algorithm 1 . 
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Algorithm 1 The Kalman Filter 



Goals: Iteratively find estimates z over 1 < t < r, given initialization (z , P ) 
Function [i(l|l), P(l|l), . . . ,i(r|r),P(r|r)] = KF(A,P,m,C,P,P ,z ,r) 

1(0 1 0) <— z Initialize at Time t — 

P(0|0) <- P 

for z <— 1 : r do 

i(i|i — 1) <— A|(i — l|i — 1) State Prediction 

P(i\i - 1) <— AP(z - l|i - 1)A T + PP T Covariance Prediction 

K <- P(i\i - l)C T im(CP(i\i - 1)C T + P) Gam 

l(i|i) ^— l(i|i — 1) + K(m(i) — Cz(i\i — 1)) State Update 

P(i\i) ^— P(i|i — 1) — KCP(i\i — 1) Covariance Update 

end for 



IV. Discussion 

The preceding derivation shows that the optimum linear least-squares estimator for a 
dynamic model of the form (4.1), (4.4) can be written in an efficient, recursive form, 
known as the Kalman filter. The three key steps are 

I. Indirect estimation — how to generate estimates of linear functions of z, allowing 
estimates to be projected or predicted over time; 

II. Multiple measurements — how to generate estimates based on multiple groups 
of decorrelated measurements; 

III. Measurement decorrelation — how to actually decorrelate the measurements in 
order to separate "past" and "current", the key to the recursion. 

The structure of the resulting equations can be seen much more clearly in the time- 
invariant case: 

A(t) = A B(t) = B C(t) = C R(t) = R, (4.72) 

in which the Kalman filter may be written as 

Prediction: z(t\t - 1) = Az(t - l\t - 1) (4.73) 

P(t\t - 1) = AP(t - l\t - l)A T + BB T (4.74) 

Update: z(t\t) = z(t\t - 1) + K(t) (m(t) - Cz(t\t - 1)) (4.75) 

P(t\t) = P(t\t - 1) - K(t) • C • P(t\t - 1) (4.76) 

K(t) = P(t\t - 1) • C T • (CP(t\t - l)C T + R)' 1 (4.77) 

As should be expected, the dynamics A, B affect only the prediction step; the update 
step is essentially a static estimator, taking place at a fixed point in time, and knowing 
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Example 4.2: Kalman Filtering and Interpolation 



Suppose that we have a dynamic interpolation problem 

where w ~ A/"(0, 1), v ~ A/"(0, a 2 ) (4.70) 



z(t) =az(t- 1) +/? w(f) 



Then the Kalman filter follows easily from (4.73)-(4.77): 

Predict: 5 (t\t - 1) = az(£-l|£-l) 

P(£|£ - 1) = a 2 P(t - l\t - 1) + f3 2 

Update: K(t) = P(t\t - l)/(P(t\t - 1) + a 2 ) 

z(t\t) = z(t\t - 1) + K(t) (m(t) - z(t\t - 1)) 

P(t\t) =P(t\t-l)(l-K(t)) 



(4.71) 



where it is important to observe that parameters a, /3 from the dynamic model 
appear only in the predict step, whereas the measurement parameter a appears 
only in the update. 

Suppose we have three measurements, plotted as circles, at times £i, £2, £3. Let's 
consider the effects of varying a, /3, and a. 

If a < 1 then, over time, z — > and the rate of increase in P is mostly controlled 
by (3. At each discrete measurement, z is updated as P is reduced: 



z(t\t) 



a = 0.1 

= 1 
a small 



a = 0.9 
a large 



p(t|t) 




*W) 



p(t|t) 



The more rapidly P grows over time, the more rapidly a past measurement is 
"forgotten." At any point in time where a measurement appears, the influence of 
the measurement on the estimate depends on the relative values of P(T\t — l) and 



Example continues . . . j 
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Example 4.2: Kalman Filtering and Interpolation (cont'd) 
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Clear patterns emerge: 

Phenomenon Controlled By 


2 t 


3 


Rate of decay of z a 
Converged value of P a, (3 
Rate of growth of P Mostly (3 
Closeness of z to m a 2 and P(t\t — 1) 



nothing about dynamics. Similarly the measurements m and related model parame- 
ters C, R appear only in the update step. Note that if there are no measurements at 
some time t, then the update step is skipped completely: 



z(t\t) = z(t\t - 1) P(t\t) = P(t\t - 1) 



(4.78) 



Understanding the connection to static estimation, we see that the update step is 
essentially written as Form I (3.65). Through the ABCD lemma in Appendix A.2, 
Form II (3.67) gives us an alternate version of the Kalman update 



K(t)=P(t\t)C T (t)R- 1 (t) 

p(t\t) = (p(t\t - i)- 1 + c T (t)R-\t)c{t)y\ 



(4.79) 
(4.80) 
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Finally, observe that the uncertainty P can never increase in the update step; that 
is, 2 P(t\t) < P(t\t — 1). The measurements can only improve our understanding; 
at worst the measurements are irrelevant, in which case the gain K(t) =0 and the 
uncertainty P(t\t) = P(t\t — 1) is unchanged. On the other hand, the time-dynamics 
normally include an unpredictable stochastic term w(t), so the uncertainty P does 
increase by an amount BB T in the prediction step. 

Because the Kalman filter involves matrix-matrix multiplication and matrix inver- 
sion, the computational complexity of the filter is 0(n 3 ) per iteration, where n is the 
dimensionality of z. In cases where z represents some dynamic system with a few 
degrees of freedom, the Kalman filter is easily implemented in real-time. However, if 
z represents an image sequence, such as in video processing, where each image con- 
tains one million pixels, then G(n 3 ) may be prohibitive and other approaches need 
to be considered, which are the focus of the large-scale Kalman filtering methods in 
Chapter 10. 



4.3 Kalman Filter Variations 



A great many variations and forms of the Kalman filter have been developed to ad- 
dress different concerns or criticisms of the basic algorithm derived in Section 4.2. 
The most obvious concerns are the causality of the computed estimates, the numer- 
ical robustness or stability of the algorithm, the computational complexity, and the 
assumptions of Gaussianity and linearity. Each of these is dealt with in turn: 

Estimate Causality: The estimate z(t\i) is an estimate of z at time t, based on 
measurements up to time t. The causal nature of the estimates can clearly be seen 
in Example 4.2. Because most dynamic processes are strongly correlated over 
time, measurements at times t + 1, t + 2, . . . would be useful in estimating z(t): 

=> Section 4.3.3: Kalman Filter Smoothing 
=> Section 4.3.3: Fixed-lag Smoothing 

Numerical Stability: The numerical stability of the Kalman filter primarily in- 
volves the robustness in the calculation of the estimation error covariances, in 
particular whether the covariances remain positive-definite. 3 In particular, the co- 
variance update (4.76) 

P(t\t) = P(t\t - 1) - K{t) ■ C ■ P(t\t - 1) (4.81) 



2 Matrix inequalities are always interpreted in a positive-definite sense. Thus the inequality 
P(t\t) < P(t\t - 1) really means P(t\t - 1) - P(t\t) > 0, that is, that the difference is 
positive- semidefinite (see Appendix A.3). 

3 In estimation, if a prior covariance is used which fails to be positive-definite, even to only 
a tiny degree, it can lead to invalid estimates and negative estimation error variances. 
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looks suspicious, because P(t\t) must be symmetric, yet the right-hand side of 
(4.81) looks asymmetric: 

=>• Section 4.3.1: Joseph Stabilized Form 

In cases where the conditioning of the dynamic problem is poor, it may be neces- 
sary to develop a different algorithm with improved conditioning: 

=> Section 4.3.1: Square Root Kalman Filter 

Computational and Storage Complexity: The regular Kalman filter has a 
complexity of 0(n 3 ) per iteration, for a state of n elements. For large prob- 
lems this complexity is prohibitive; for example, a 1000 x 1000 dynamic im- 
age corresponds ton = 10 6 . If a problem is stationary over time, then matrices 
K(t),P(t\t),P(t\t — 1) do not change over time and need to be computed only 
once: 

=>• Section 4.3.2: Steady-State Kalman Filtering 

In the event that the problem is too large to allow representation via large covari- 
ances, then the problem needs to be transformed or represented differently: 

=> Chapter 5: Modeling of Large Problems 
=>• Chapter 8: Changes and Reductions of Bases 
=> Chapter 10: Kalman Filters for Large Problems 

Model Linearity and Gaussianity: Many dynamic problems may be subject 
to nonlinear dynamics, or may be observed in the presence of non-Gaussian noise. 
If the dynamics are smooth, then repeated linearization may be appropriate: 

=> Section 4.3.4: Extended Kalman Filter 

Where a linearization may be inadequate, the statistics may be better preserved by 
running a set of Kalman filters in parallel: 

=> Section 4.3.4: Unscented Kalman Filter 
=> Section 4.3.4: Ensemble Kalman Filter 

Where the nonlinearities are severe, where the noise is highly non-Gaussian, or 
where it is undesirable to calculate the problem statistics, we may want an entirely 
implicit, nonparametric filter: 

=> Section 4.3.4: Particle Filter 
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4.3.1 Kalman Filter Algorithms 

This section summarizes five of the most common implementation alternatives to the 
basic Kalman filter. However, like the classic Kalman filter, none of these are partic- 
ularly well suited for huge estimation problems; specific Kalman filter variations for 
the multidimensional case are discussed in Chapter 10. 



Update-Update Form 

Some people consider it confusing, or unnecessarily complicated, to think of sep- 
arate prediction and update steps. It is straightforward to substitute the prediction 
equations into the update ones in order to write z u (t) = z(t\t) directly in terms of 

z u (t-i) = z(t-i\t-iy. 

L(t) = (I- K(t)C(t))A(t - l)z u (t - 1) + K(t)m(t) (4.82) 

P u (t) = (I- K(t)C(t)) (A(t - l)P u (t - l)A T (t - 1) + B{t - l)B T (t - 1)). 

(4.83) 

A similar predict-predict form can also be derived. 



Information Form 

The standard discrete-time Kalman filter is written in terms of estimation error co- 
variances P(t\t), P(t + l\t). It is possible, however, to formulate [8, 151] the Kalman 
recursion in terms of matrix inverses P~ 1 (t\t), P~ x (t + 1 1 1) . There are three reasons 
why a matrix-inverse representation might be desirable: 

1. Matrix Sparsity: 

As we show in Chapters 5 and 6, there is a wide variety of prior models and 
constraints in which P is dense, but P _1 is sparse, implying that the inverse 
covariance, or matrix information, may in many cases be a more natural and 
efficient representation. Sparse methods in Kalman filtering are developed further 
in Chapter 10. 

2. Matrix Approximation: 

Related to the preceding point on sparsity, for a given computational and storage 
complexity we can usually approximate P~ l more accurately than P. 

3. Representation of Complete Uncertainty: 
Given a variance a 2 : 



4.3 Kalman Filter Variations 103 

Perfect certainty a 2 = 

Complete uncertainty a 2 = oo 

Whereas representing in terms of an inverse variance: 

Perfect certainty a~ 2 = oo 

Complete uncertainty a~ 2 = 

The degrees of certainty anticipated in a given problem context may motivate one 
form over the other, as numerical computation with oo may be inconvenient. 

Suppose we let a state variable be defined as 

j^W-^ftWiM*)- ( 4 - 84 ) 

In the inverse-co variance case, the update step becomes straightforward: 

From (3.67) P~\t\t) = P-\t\t - 1) + C T (t)R-\t)C(t) (4.85) 

From (4.78) z^t) = z t (t\t - 1) + C T (t)i? _1 (t)m(t), (4.86) 

whereas the simple prediction stem becomes much more complicated in the inverted 
context. Applying the ABCD lemma (A.39) to (4.74) yields 

P-\t\t - 1) = A- T P~\t -l\t- l)^- 1 - A- T p-\t -l\t- l)^- 1 ^- 
[B T A- T p-\t -l\t- VjA^B + I] _1 B T A- T p-\t -l\t- 1)A~\ (4.87) 
Finally, applying this result to (4.73) results in the state prediction 

zt(t\t - 1) = 1 1 - A- T p-\t -l\t- l)^- 1 ^. 

[B T A- T p-\t -l\t- VjA^B + /] _1 B T ^A T z i (t - l\t - 1). (4.88) 

In practice, we may choose to avoid the complexity of this latter step by performing 
the prediction step in the usual way: 

z l (t\t-l)=P- 1 (t\t-l)z(t\t-l) 

= P _1 (t|t - l)A(t - l)z(t - l\t - 1) (4.89) 

= P~\t\t - i)A(t - i)P(t -i\t- i)ii(t\t - i). 

Square Root Form 

The information form, just discussed, develops a Kalman filter based on covariance 
inverses which may be of some interest in sparse contexts, but of limited significance 
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otherwise. Of much greater practical interest is the square root form [8, 151,231], 
which offers numerical robustness, making the Kalman filter applicable in poorly- 
conditioned circumstances. As poor conditioning is common in large-state problems, 
which we expect to encounter in image and multidimensional processing, the square 
root form is particularly relevant to us. 

The strength of the square root Kalman filter stems from the two following properties 
of matrix square roots (Appendix A. 8): 

1 . If a large covariance matrix P is approximated in any way, P « P, then it is 
extremely likely that the approximate matrix P fails to be positive-definite, 

P > but P ^ 0, (4.90) 

whereas given r, a square root of P, and an approximate square root P, then 

rr T > and Pf T > 0. (4.91) 

That is, an approximated square root guarantees positive-semidefiniteness. 

2. Given P, a square root of P, then 



P = rr T => n(r) = ^n{P). (4.92) 

The number of floating-point digits required is roughly log 10 (ft), therefore the 
square root form requires only half the precision of the standard Kalman filter, 
leading to significant savings in storage and computational complexity (also see 
Appendix A. 8). 

Thus the square root Kalman filter offers substantial benefits of numerical robustness. 

One possible square root filter [333] follows from applying QR decompositions (Ap- 
pendix A.7.3). Suppose we begin with z(t\t — 1), r(t\t — 1) where 

p(t\t - 1) = r(t\t - i)r T (t\t - 1), (4.93) 

such that P is the numerically robust representation of covariance P. Then we can 
formulate square root update and prediction steps, as follows. 

Square Root Update Step: 



We begin with a QR decomposition 

'c(t)r(t\t-i) R^ 2 {t) 
r(t\t-i) o 



Q(t) o 
Kit) r{t\t) 



(4.94) 



Given . . . Computed via QR 

which computes the updated uncertainty r(t\t); the updated estimates are found as 

z(t\t) = z(t\t - 1) + K{t)Q- 1 {t){m{t) - C(t)z(t\t - 1)). (4.95) 
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Square Root Predict Step: 

The estimates are predicted as usual, 

z(t+l\i) = A(t)z(t\t). (4.96) 

A second QR decomposition computes the predicted uncertainty: 

[A(t)r(t\t) B(t)] — [r(t+l\t) 0] (4.97) 

V v ' V v ' 

Given . . . Computed via QR 

Why Does this Work? 

The QR decomposition (Appendix A.7.3) orthogonally transforms a given matrix to 
make it triangular. By squaring both sides the orthogonal matrix disappears 

AU = H => AA T = HH T . (4.98) 

Therefore each of (4.94) and (4.97) can be squared, and like terms equated, to vali- 
date the result. We begin with the simpler predict step: squaring (4.97) yields 

A(t)r(t\t)r T (t\t)A T (t) + B(t)B T (t) = r(t + i\t)r T (t + i\t) + o, (4.99) 

thus 

A(t)P(t\t)A T (t) + B(t)B T (t) = P{t + l|t), (4.100) 

which we recognize from the usual Kalman filter prediction step (4.74). 
Next, squaring the more difficult (4.97) yields 

cr(t\t - i)r T (t\t - i)c T cr(t\t - i)r T (t\t - 1)~ 
r(t\t - i)r T (t\t - i)c T r(t\t - i)r T (t\t - 1) _ 

" kq t RR T + r(t\t)r T (t\t)\ • (4 ' iUij 

Equating the top-left terms yields 

QQ T = CP(t\t - l)C T + R. (4.102) 

Then, selecting either of the off-diagonal terms yields 

K = P(t\t-l)C T Q-\ (4.103) 

Finally, equating the lower-right terms then yields the desired result: 
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P(t\t - 1) = P(t\t) + KK T (4.104) 

= P(t\t) + P(t|* - l)C ,T Q- 1 Q- T CP(t|t - 1) (4.105) 

= P(t|t) + P(t|t - l)C T (CP(t\t - l)C T + R)~ 1 CP(t\t - 1), 

(4.106) 

which is the Kalman filter co variance update (4.76). 

A great many variations on the square root filter are available [151], depending on 
exactly what sort of matrix factorization is undertaken, and whether the square root 
filter is in regular, predict-predict, update-update, or information form. 



Joseph Stabilized 

The most common form for the updated error covariance (4.69), (4.76) is 

P(t\t) = (I - K(t)C(t))P(t\t - 1). (4.107) 

Given that P(t\t) is a covariance matrix, by definition it must be symmetric and 
positive-semidefinite. However, numerical rounding errors in the computation of 
K(t), or of the products in (4.107), can lead to the computed P(t\t) being asym- 
metric or containing negative eigenvalues. 

An alternative, but algebraically equivalent, form of (4.107) is the Joseph stabilized 
form [47]: 

P(t\t) = (I - K(t)C(t))P(t\t -1)(J - K(t)C(t)) T + K(t)R(t)K T (t). (4.108) 

In general, because 

P>0 => APA T >0 \/A, (4.109) 

then ifP(t\t—l) and R(t) are symmetric, positive-semidefinite, then the Joseph form 
(4.108) guarantees that P(t\t) remains symmetric, positive-semidefinite, regardless 
of rounding errors in K. 

The improvement in numerical stability and conditioning comes at a cost of addi- 
tional matrix products in (4.108) relative to (4.107). 

Singular Filtering 

There may be some circumstances under which one or more measurements may be 
known exactly, such that R(t) is singular, meaning that (CP(t\t — 1)C T + R) may 
not be invertible, as required in the update step. 

The general approach [8], then, is to divide the problem into two parts: 
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1 . The part which is measured exactly, and therefore does not need to be estimated. 

2. The part which is measured, as usual, in the presence of additive noise, which is 
estimated using the regular Kalman filter. 



In particular, suppose that the measurement model can be written as 



m(t) = 



'mi(t)" 
_m 2 (t) 


= 


~Ci(t) 

C 2 (t)_ 


z + 






V.2 ~ 



ifc(t). 



(4.110) 



Without loss of generality, we assume C\ to have full row rank, 4 so that we can 
similarly transform the state z into two corresponding parts, 






Dz 



Ci(t) 
D 2 



(4.111) 



D 2 is any matrix which makes D invertible, guaranteeing that z 2 spans the entire 
portion of the state not exactly measured by m 1 . The first part of the state is known 
exactly, 

z 1 (t) = m 1 (t) 



and for the second part an estimation problem can be defined: 



m 2 (t) = C 2 (t)z(t) + v 2 (t) = C 2 (t) 



m x (t) 
cov(z 2 {t))=D 2 P(t\t-l)Dl, 



D 2 



(4.112) 

v 2 (t) (4.113) 
(4.114) 



from which we estimate z 2 (t) with estimation error covariance P2(t). The estimates 
for the partitioned state allow us to reconstruct the original: 



m = 



D 2 



hit). 



p{t\t) = 



~Ci{t) 


-1 




o P 2 (t)_ 




'CM 



(4.115) 



4.3.2 Steady-State Kalman Filtering 



The case of linear time-invariant (LTI) systems is of special interest. Although such 
systems may occur infrequently in practice, their time-stationarity leads to attractive 
simplifying properties. 

Suppose we are given the usual first-order Gauss-Markov model, as in (4. 1), but now 
time-invariant: 



4 If there are redundant exact measurements, the redundant ones can be removed until full 
rank is achieved. 
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z(t + 1) = Az(t) + Bw(£) m(t) = Cz(t) + v(t) v - i?, (4.116) 

where time-invariance implies that model parameters A,B,C,R are fixed over time. 

If the dynamic system is stable, meaning that the eigenvalues \ of A satisfy | \ \ < 1, 
then the statistics of z will reach steady-state (that is, will be invariant over time). In 
steady-state the covariance P(t) of z will converge to the solution of the algebraic 
Lyapunov equation [200] 

Poo = AP OQ A T + BB T . (4.117) 

The Kalman filter itself can also reach steady-state, meaning that the Kalman gain 
K(t) and the estimation error covariances P(t\t — 1), P(t\t) converge to fixed values 
as t —> oo. In particular, under fairly general conditions (system reachability and 
observability 5 ), the predicted error covariance will converge to the unique, positive- 
definite solution to the algebraic Riccati equation [200] 

P p = AP P A T + BB T - AP P C T (CP P C T + R)' 1 CP P A T , (4.118) 

from which the constant Kalman gain follows as 

K = P P C T (CP P C T + R)' 1 . (4.119) 

Finally, it is instructive to see how the predicted estimation error evolves; given 

z(t + 1) = Az(t) + Bw(t) (4.120) 

l(t + l|£) = Az(t\t) =A(K(m(t)-Cz(t\t-l)) + z(t\t - 1)\ (4.121) 

then the predicted estimation error is 

z_(t + l|t) = z(t + l|t) - z(t + 1) (4.122) 

= A(if(m(t) - Cz(t\t - 1)) -hl(t|t - 1)) - (Az(t) + Bw(tj\ 

(4.123) 
= A(I - KC)z(t\t - 1) + AKj;(t) - Bw(t). (4.124) 

That is, the predicted estimation error itself obeys a dynamic relationship which is 
stable when A(I — KC) is stable, a stability not dependent on the stability of A. Be- 
cause the Kalman gain matrix K is essentially the regularized solution to the inverse 
problem m = Cz, we expect that KC w /, and so (/ — KC) ~ 0. 

Given the steady-state gain matrix K, the Kalman filter reduces to 



5 Meaning that, over time, each element in z is affected by the driving process w, and that the 
measurements feel the influence of each element of z to some degree; in other words, that 
no portion of the state be completely decoupled from w and m. Specifically, it is required 
that [B AB AAB . . .} and [C T A T C T A T A T C . . .} be full rank. 
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z(t\t - 1) = Az(t - l\t - 1) (4.125) 

z(t\i) = z(t\t - 1) + K(m(t) - Cz(t\t - 1)), (4.126) 

involving only fast matrix- vector operations, no matrix multiplications or inverses, 
a huge reduction in complexity to 0(n 2 ) per iteration from 0(n 3 ) per iteration for 
the full filter. Keep in mind the following limitations: 

1. Steady-state filtering applies only to cases in which the underlying dynamics are 
stationary, and in which the measurement model is stationary (constant model 
C and quality R). In particular, cases of irregular measurements (e.g., sparse 
measurements of an image) do not satisfy stationarity. 

2. The steady-state statistics and gain matrix can be difficult to compute; in par- 
ticular, the Riccati equation (4.118) can be very challenging computationally. In 
practice, it is often much simpler to run the regular Kalman filter for t steps, 
where t is chosen "large enough" such that K(t) ~ K,P(t\t — 1) ~ P p . 



4.3.3 Kalman Filter Smoother 

The Kalman filter is strictly causal, in the sense that an estimate z(t) is affected by 
measurements only at time t and earlier. However, even for causal dynamic pro- 
cesses, measurements from the future can be extremely useful in estimating the 
present. 

For example, consider the simple first-order causal dynamic model 

z(t + 1) = az(t) + w(t). (4.127) 

The autocorrelation of z is easily computed as 

E[z(t + s)z(t)] = E[(az(t + s - 1) + w(t))z(t)] = a s E[z(t)z(t)] (4.128) 

E[z(t - s)z(t)] = E[z(t - s)(az(t - 1) + w(t))] = a s E[z(t - s)z(t - s)] , 

(4.129) 

where s > 0. If z is in statistical steady-state, meaning that E[z(t)z(t)] = £ is con- 
stant over time, then the autocorrelation of causal process z(t) is perfectly symmetric 
over time, 

E[z{h)z{t 2 )] =a^~^^ (4.130) 

implying that measurements of z(t — l),z(t + 1) are both equally useful in esti- 
mating z(t). Although this example examined a simple, scalar, stationary case, the 
conclusion applies generally: a noncausal estimator is likely to yield more accurate 
results than a causal one. Indeed, as the process noise gets smaller (smaller B) and 
as the measurements become more infrequent, the greater is the possible benefit of 
future measurements in reducing the estimation error in the present. 
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Fig. 4.2. The Kalman smoother is a two-pass algorithm: first the regular Kalman filter for- 
wards over time, and then a backwards iteration. All of the estimated states and error covari- 
ances from the forward pass must be saved for the backward pass, meaning that the storage 
requirements are greatly increased over those of the standard Kalman filter alone. 



The idea is sketched in Figure 4.2: given the results z(t\i),P(t\i), < t < r of 
the Kalman filter, is it possible to find an algorithm, ideally iterative, which works 
backwards, finding z(t\r), P(t\r)l 

Indeed, such an algorithm, the Kalman smoother, was developed by Rauch, Tung, 
and Striebel [266]. Recall the innovations process v(t) from (4.52), 

v(t) = m(t) - m(t\t - 1). (4.131) 

This process is white, but contains all of the information present in the measure- 
ments. Then the smoothed estimate can be written [333] as the sum of the predicted 
estimate, z(t\t — 1), and the estimate of the prediction error z_(t\t — 1) based on future 
measurements: 



i(*l T ) = Yl K( < t ^ <s )^ 5 ) = ^ ~ *) + J2 K< < t ^ s )^( s ) (4.132) 

s=0 s=t 

T 

= l(t\t -1) + J2 E [K f I* - !k T 0)] E [m.(s)e T (s)] " V(s) (4.133) 



z(t\t - 1) + J2 E [Ut\t - l)l T (s|s - l)]C T (s)- 

s=t 

(C(s)P(s\s - l)C T (s) + i?(s))"V(s) 

r 

i(t\t - i) + p(t\t -i)J2 pT ^) ■ ■ ■ fT ( s - 1 ) cT ( s )- 

s=t 

{C(s)P(s\s - l)C T (s) + R(s)y\(s), 



(4.134) 



(4.135) 



where 

F(t) = A(t){l - K(t)C(t)} (4.136) 

is the dynamic matrix, as in (4.124), governing the evolution of the prediction errors. 
By inspection, the summation term in (4.135) admits a recursive form, leading to the 
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backwards recursion for the Kalman smoother: 

z(t\r) = z(t\t - 1) + P(t\t - l)u(t\ T ) (4.137) 

where 

u(t - l|r) = F T (t - l)u(t\r) + C T (t - 1). 

(C(t - l)P(t -l\t- 2)C T (t - 1) + R(t - l))"V(t - 1) (4.138) 

with the recursion initialized as u(r + l|r) =0. 

Although this recursion was relatively straightforward to derive, there are simpler 
and more elegant alternatives [151]: 

z(t\r) = z(t\t) + Q(t)(z(t + l|r) - !(£ + l|t)) (4.139) 

P(t|r) = P(t|t) + Q(t)(P(t + l|r) - P(t + l|t))Q T (t), (4.140) 

where 

Q(t) = P^I^A^P- 1 ^ + l\t). (4.141) 

The application of these latter smoother equations to one-dimensional first- and 
second-order systems is illustrated in Example 4.3. 

Although it is clear that the smoother can provide improvements in estimation accu- 
racy by using future measurements, it is important to understand its limitations: 

• The Kalman filter is recursive, requiring no past history to be stored. However, 
because the smoother requires the entire forward sequence z(t\t — 1), P(t\t — 1) 
to be stored, the storage complexity of the smoother is r times greater than that 
of the Kalman filter. 

In those cases where z is a long vector, such as in image estimation, this storage 
burden may be significant. This issue will be examined further in Chapter 10, and 
the reader may be interested in looking at Example 10.1 on page 332, where the 
Kalman smoother is applied to an image. 

• The delay between receiving measurement m(0) and producing the smoothed es- 
timate 1(0 |r) clearly grows with r. However (see Problem 4.3), although P(0|r) 
is a decreasing function of r, in most circumstances the benefit is limited to small 
values of r, where larger r increases delay and storage requirements but with 
almost no additional benefit in accuracy. 

In those cases where a very modest smoothing window is adequate, it is straight- 
forward to implement a recursive fixed-lag smoother directly in the Kalman filter. 
Indeed, we just define an augmented state, as in (4.8): 
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Example 4.3: Recursive Smoothing and Interpolation 



Example 4.2 on page 98 looked at the application of the Kalman filter to interpo- 
lation. Recall that the estimates z(t\t) were conspicuously causal: 

Measurement m(s) influences z(t\t) only for t > s. 

However, the causality of the Kalman filter does not imply a causality in the un- 
derlying statistics: m(s) does tell us a great deal about z(t) for t < s. That is, 
we expect quite different results between z(t\t) from the Kalman filter and z(t\r) 
from the Kalman smoother. 

We will consider two models: 
I. First Order: z(t) = z(t — 1) + w(t — 1) z is a random walk 

II. Second Order: z(t) = z(t — 1) + A(t — 1) In this case the slope 
A(t) = A(t — 1) + w(t — 1) of z is a random walk 

where we can rewrite the second-order model in a single state as 



m 



1 1 

1 



z(t - 1) 



w(t-l). 



(4.142) 



The following two results are for the first-order case, where the difference be- 
tween the causal estimates of the Kalman filter (dashed) and the Kalman smoother 
(solid) can be clearly seen: 



a = 0.1 
(3 = 1 
a small 



a = 0.9 
a large 





o h t 2 t 3 



o h t 2 t 3 



Example continues . . . j 
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Example 4.3: Recursive Smoothing and Interpolation (cont'd) 



Two further results for the second-order case: 

z(t\t) 

a = 0.9 
/? = 0.1 
a small 



z(t|t) 



a = 0.3 
/? = 0.9 
a small 



Observations: 




p(i|i) 



p(t|t) 




• Although the dynamic model for z(£) is written causally, the smoothed esti- 
mates z(t\r) are acausally dependent on the measurements. 

• In all cases, the smoother can only improve our certainty, 

P{t\r) < P(t\t) 

since the measurements available to the smoother include all of those available 
to the filter, plus some more. 

• The smoothed estimates are, in most cases, more appealing, aesthetic, and 
desirable than the causal ones. The difference is particularly striking in the 
second-order case. 



m 



' z(t) ■ 
z(t - 1) 



■A(t - 1) 







such that the resulting estimates 



I 



z(t - 1) + 


"B(t - 1)" 



w(t - 1) 







(4.143) 
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z(t\t) 



l(t\t) 



z(t - r\t) 



(4.144) 



include the desired, fixed-lag smoothed estimate z(t — r\t). Although the aug- 
mented dynamic matrix in (4.143) is sparse, the computed error covariances 
P(t\t), P(t\t — 1) will be dense. Therefore with a storage complexity of 0(r 2 ) 
and a computational complexity of 0(r 3 ) the fixed-lag approach is feasible for 
only relatively modest r. 

Further details on the smoother, including square root and information forms, can 
be found in [8,28, 125, 151]. Multidimensional Kalman smoothers are discussed in 
Chapter 10. 



4.3.4 Nonlinear Kalman Filtering 

In this text we focus on linear estimation problems, so the following discussion on 
nonlinear filters represents a bit of a digression, included because the methods are 
fairly intuitive and nicely complement the Kalman filter discussion of the rest of this 
chapter. 



Extended Kalman Filter 

The extended Kalman filter [151, 156] allows for the system dynamics and measure- 
ment functions to be nonlinear functions of the current state. However, because non- 
linear transformations of random vectors (Appendix B.2) are very complicated, the 
dynamics and measurement functions are linearized at each time step in computing 
the predicted and updated error covariances. 

So, given possibly nonlinear dynamic and observation models 

z(t + 1) = A(z(i)) + Bw(t) m(t) = C(z(t)) +v(i) (4.145) 

the Kalman filter equations of (4.73)-(4.77) may be rewritten as 

Prediction: z(t\t - 1) = A(z(t - l\t - 1)) (4.146) 

P(t\t - 1) = A t -!P(t - l\t - l)Af_! + BB T (4.147) 

Update: z{t\t) = z(t\t - 1) + K(t) (m(t) -C(z(t\t - 1))) (4.148) 

P(t\t) = P(t\t - 1) - K(t) ■ C t ■ P(t\t - 1) (4.149) 

K(t) = P(t\t - 1) • Cj ■ (C t P(t\t - l)Cj + R)~\ (4.150) 
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where A t , C t are the linearizations 



A - — 

dz 



z(t\t) 



C t =- 
dz 



(4.151) 

z(t\t-l) 



The implementation is thus a straightforward extension of the standard Kalman filter. 
As only the estimates themselves, and not the associated statistics, are propagated via 
the nonlinear dynamic and measurement functions, the performance of the extended 
Kalman filter can be quite poor, especially for rapidly-varying or discontinuous non- 
linearities. 



Ensemble Kalman Filter 

The one limitation of the extended Kalman filter is the inability to propagate the 
statistics through the model nonlinearities. Although such nonlinear propagation is 
difficult to do analytically (see Section B.2), it is easy to do so numerically by com- 
puting sample statistics. Even for a linear problem, inferring the statistics from sam- 
ples frees us from having to predict and update the estimation error co variances, 
convenient for dynamic problems having a large state. 

Such a covariance propagation by sampling is the essence of the ensemble Kalman 
filter [100, 101, 164]. An ensemble of estimates is predicted at each time step, and 
then updated on the basis of measurements with added random noise. 



We begin by initializing random samples from the prior distribution 

li(0| - 1) - AT(z(0),P(0)). (4.152) 

Each of the random samples is essentially iterated as a Kalman filter, all in parallel. 
Each sample is passed through the (possibly) nonlinear prediction 

z^t - 1) = A^it -l\t- 1)) (4.153) 

from which the estimate and error covariance can be computed as sample statistics: 
z(t\t-l) = Sample Mean ({lj), (4.154) 

P(t\t - 1) = Sample Covariance ({!*}) • (4.155) 

For the update step, random noise v { (t) is added to the measurements in order to 
prevent the ensemble members from converging: 

li(t\t) = z z (t\t - 1) + K(t)(jm(t) + Vi(t)) - Cz(t\t - 1)) (4.156) 

K(t) = P(t\t - 1) • C T ■ (CP(t\t - l)C T + R)~ l , (4.157) 
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Fig. 4.3. The ensemble (left) and unscented (right) Kalman filters take very different ap- 
proaches in the use of ensembles to handle nonlinearities. The ensemble KF has a fixed set of 
filters, run in parallel, whereas the unscented KF computes an explicit estimate z at every step 
and resamples after every predict and update step. 



where we observe the absence of the equation to compute P(t\t), as this is no longer 
needed. 

The drawback of the ensemble Kalman filter is that the number of ensemble members 
may need to be relatively large in order to obtain reasonable sample statistics of the 
estimation mean and co variance. As with the regular Kalman filter, a number of 
variations exist, such as a square root ensemble Kalman filter [311]. 



Unscented Kalman Filter 

The unscented Kalman filter [18 1 ,325,326] is closely related to the ensemble Kalman 
filter, above, with two key differences: 



1. In the ensemble Kalman filter we essentially had N individual Kalman filters, 
running in parallel, and not interacting with each other except in the computation 
of sample means and co variances. 

In contrast, the unscented Kalman filter is essentially a single filter, with a single 
predicted and estimated value per time step, which resamples every predict and 
update step, as illustrated in Figure 4.3. 

2. Whereas the ensemble filter is a Monte Carlo method, requiring a substantial 
number of ensemble members to properly reproduce the underlying statistics, 
the unscented filter uses a small number of deterministic samples. Second, rather 
than the usual sample mean and sample covariance of (4. 154), (4. 155), a weighted 
version of these equations is used. 
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The propagation of statistics through nonlinearities is considerably improved by this 
method over the extended Kalman filter, and with many fewer ensemble members 
than the ensemble Kalman filter. However, we still have a second-order representa- 
tion (mean and co variance) at each time step, a representation which may be poor in 
the case of highly asymmetric distributions stemming from sharp nonlinearities. 



Particle Filters 

Particle filters [170,256] are a natural continuation of the ensemble and unscented 
approaches, however the topic is a large field, with entire textbooks [88, 272] dedi- 
cated to the topic, so we can provide only a superficial treatment here. 

The extended, ensemble, and unscented filters all suffer from the explicit compu- 
tation of and reliance on error covariances, essentially implicitly assuming a Gaus- 
sian distribution. Since most nonlinear problems will have asymmetric non-Gaussian 
statistics, the covariance representation is known to be poor. 

The key to the particle filter is that the distribution of the estimated state is repre- 
sented implicitly, rather than explicitly; we do not explicitly calculate an estimate 
mean or covariance. 

Given a set of estimates {z^t — 1 \t — 1)} at time t — l 9 each estimate is imagined as 
a particle (or hypothesis) in n-dimensional space. One iteration of the particle filter 
proceeds as follows: 

• Prediction: Apply the temporal dynamics, which could be linear or nonlinear, and 
deterministic or stochastic. Each particle is predicted separately, giving us the 
predicted set 

z^t - 1) = Predict (^(t - l\t - 1)). (4.158) 

• Update: We assert some metric which measures the quality of each particle, as 
assessed by the measurements. Essentially, we want to infer the likelihood (or 
some equivalent metric) of each state particle 

Pt,i = ^(z i (t\t-l)\m(t)). (4.159) 

• Resampling: We need to steer the ensemble set in favour of likely candidates; that 
is, we wish to probabilistically favour those states which are consistent with the 
measurements, however with some probability also preserving less likely states. 
The biasing towards the measurements ensures accurate estimation, with most 
states highly consistent with the observations, whereas the preservation of less 
likely states adds robustness, for example in the case where the measurement is a 
spurious outlier, or in cases of occlusion (where part of the state is hidden). 
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In principle, we could nonparametrically estimate the distribution p(z(t)\m(t)) 
from the likelihoods {pt,i}, and then randomly resample particles from this dis- 
tribution. 

At every stage of the filter, the ensemble of particles implicitly, nonparametrically 
describes the probability density p(z), clearly not relying on a single mean and co- 
variance. 



4.4 Dynamic Sampling 

Section 3.3 examined random prior or posterior sampling in the static context. Now 
suppose that we have a dynamic process z(t), for which we would like to generate 
random samples. 

We continue to assume that the dynamic process z(t) obeys the first-order Gauss- 
Markov model of (4.1): 

z(t + 1) = A(t)z(t) + B(t)w(t), z(0) ~ N(z 01 P ), (4.160) 

where w { is a white Gaussian noise process with identity covariance. The first-order 
Gauss-Markovianity is crucial, because it allows the statistics of the next time step 
to be expressed in terms of the current state: 

z(t + l)\z(t) ~ (A(t)z(t),B(t)B T (t)). (4.161) 

Furthermore, the matrix B(t) is already the square root of B{t)B T (t), the covariance 
of the added noise. Therefore prior sampling proceeds recursively, such that 

z(t + 1) = A(t)z(t) + B(t)w(t) (4.162) 

and only the initial covariance matrix P = r o rJ needs to be decomposed per 
(2.76), where we initialize the recursion with 

z(0) = z o + r o w. (4.163) 

However, it should be noted that in a derived dynamic model the parameters B{ are 
not, in fact, normally known; rather the inferred dynamic process takes the form 

z(t + l)=A(t)z(t)+(i(t), z ^M(z 01 P ), w(t)~AT(0,W(t)) (4.164) 

where w(t) is a noise process with covariance W(t). In this case we must decompose 
P = r o rJ as before, but also W(t) = r t r^ for each time step, a great deal more 
work than before. In this case the key to efficiency is somehow to keep the size of 
matrix W(t), and thus the state dimension of z(t), as small as possible. 

The problem of dynamic posterior sampling is discussed in Section 1 1 . 1 in the chap- 
ter on sampling. 
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Most of this text is concerned with the estimation of continuous- state systems, how- 
ever there are contexts (particularly in Chapter 7) where the dynamic state is discrete. 
We begin by characterizing the discrete- state problem, followed by the Viterbi algo- 
rithm, recursive like the Kalman filter, but for discrete- state systems. 



4.5.1 Markov Chains 



We have extensively looked at the first-order Gauss-Markov dynamic model of (4. 1). 
The first-order discrete- state analogue of (4.1) is known as a Markov chain [42, 137]. 
Given the discrete state z(t) G \P Z , where \P Z = {^} is the set of possible state 
values, we have a discrete set of state-to- state transition probabilities 



Pr(*(t + l)=^i|*(t)=^). 



(4.165) 



Thus the dynamic matrix A completely characterizes the transition statistics from t 
to t + 1, and indeed for all time if the process is time- stationary. 



It is possible to have higher-order processes, such as 

Pv(z(t + 1) = iPi\z(t) = iPj,z(t - 1) = fa) 
or nonscalar processes, such as 



Pr 



"zi(t + l) = v' 
z 2 (t+l) =w 



Z\{t) = X 

Mt) =y 



(4.166) 



(4.167) 



Both of these cases can, in principle, be converted into the first-order scalar form 
using a state augmentation process very similar to that in Section 4.1.1, so our focus 
is on the first-order scalar case. 



If we let %i(i) represent the state probability distribution at time t, 

s(*) = 



Pr(z(t) = t/> ) m 
Pr(z(t)=^i) 



then the stochastic evolution of the process is elegantly written as 

n(t + i) = A u (t). 



(4.168) 



(4.169) 



What is attractive about this formulation is that a wide variety of statistical proper- 
ties of the Markov chain, such as connectedness, periodic cycles, and steady-state 
behaviour, can be inferred from an analysis of A. 



120 4 Dynamic Estimation and Sampling 




Measurement #1: Mood 
Upset Happy 




® '* © :jd 




Dynamic (Prior) Model 



Umbrella Nothing Hat 

Measurement #2: Head Covering 
Measurement (Forward) Model 



Fig. 4.4. A discrete- state first-order dynamic model. The dynamic model, left, is described 
in terms of state- transition probabilities. The measurement model, right, describes the state- 
dependent probabilities of two observations — mood and head- covering. Clearly the model is 
simplistic: people probably would be happy if it were to rain after a few weeks of sun, however 
such a model requires memory and is not first order. 



Analogously to the continuous- state observation equation, here the discrete state z(t) 
of the Markov chain influences some observation m(t), such that the forward model 
is expressed as a conditional likelihood 

c(ra,V0 = p(m(t)\z(t) = ip) (4.170) 

if the measurements are continuous, or as 



Cij = Pr(m(t) =rrii e \P m \z(i) = ipj) 
if the measurement at each time is a discrete scalar. 



(4.171) 



An example of a discrete- state dynamic model and a corresponding measurement 
model are shown in Figure 4.4. 



4.5.2 The Viterbi Algorithm 



We are given a scalar Markov chain, characterized by A, together with observations 
m(t), 1 < t < r. A state sequence 



21 



z(t) 



e& T 



(4.172) 



4.5 Dynamic Estimation for Discrete-State Systems 121 

is composed of states over time, such that there are |^ r | = \\P\ T possible sequences. 
We wish to estimate such a state sequence, in particular the MAP estimate 

Imap - ar g^G^ maxPr(z|ra(l), . . . ,ra(r)). (4.173) 

Although the total number |^| r of possible state sequences is impossibly large, be- 
cause z is first-order Markov it is possible to use methods of induction or dynamic 
programming to find a recursive solution, the Viterbi algorithm [320], to (4.173). 

Let 

Pi (t\s) = Pv(z(t) = ^|m(l), . . . , m(s)) (4.174) 

and suppose that 2(1 1 0) * s given, analogous to the Kalman filter initialization z , P 
at time t = 0. Then the prediction step is given by the Markov chain statistics (4. 169): 

U(t\t - 1) = Au(t - l\t - 1) (4.175) 

and the update step follows from B ayes' rule: 

x Pr(m(t)\z(t) = ibi) 
Pi (tt) = Pi (t\t - 1) • l ^ )' ^ - (4.176) 

The resulting vector Ji(t\t) is the probability of the possible values of z(t), summing 
over all possible state sequences before time t. 

Suppose we are given a state sequence and its associated probability: 

z(l) 



z(t - 1) 



Pi(z(t-l)\t-l). (4.177) 



[z(t - 1) 
Then the predicted probability of the next state transition is 

Pr([z(t - 1), z(t)} \t-l)= a^),^-!) Pr(z(t — 1)|*— 1) (4.178) 

therefore, as in (4.176), the updated probability is 

Pr([z(t - 1), z(t)] 1 1) = a zit)Mt _ 1} Pr(z(t - l)\t - l) ^|^ , 

(4.179) 

where the calculation of Pr (m(t)) may be difficult, and has no effect on the relative 
state- sequence probabilities, so that the normalization factor in the denominator can 
be ignored. 

Not all state sequences are very likely, however. It is the MAP estimate 1 MAP for 
which we would like to derive a forward-recursion. Of all possible state sequences 
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which end up in state ipi at time t — 1, let J.map(^ — 1 1£ — 1) be the most likely one. 
Then, using (4.179), the MAP estimate at the next time step is just the most likely of 
all possible sequences: 

iLp(*l*)= [1map(*-i|*-i)^-] 

where i = arg fc maxPr([i^ A p(£ - l\t - 1), ipj] \ t\ (4.180) 

At the final time step, the MAP estimate is given by that sequence with the highest 
probability: 

1mAp( t I t ) = m f x {lMAp( T k)}' (4.181) 

Figure 4.5 provides a visualization of the Viterbi estimation process for the model 
of Figure 4.4. Although the three optimal paths Imap(^) happen to be coincident, 
this is not a requirement; it is perfectly possible to have the \]P\ state- sequences with 
different state histories. 



4.5.3 Comparison to Kalman Filter 

The Viterbi and Kalman filters both solve a discrete-time estimation problem and are 
both based on a prediction-update framework, however there are significant differ- 
ences. 

First, the Kalman filter is optimum in a least- squares sense, whereas the Viterbi 
method finds the globally optimum MAP solution. For a discrete- state problem, the 
estimate is either correct or incorrect, so a least-squares notion of "closeness" is in- 
appropriate. 

Second, it is important to realize that the Viterbi method is essentially smoothing, 
estimating the optimum state sequence given all of the measurements, in contrast 
to the Kalman filter. The distinction arises because the discrete- state system allows 
the enumeration of all |^| possible state values at some point in time, and the first- 
order nature of z(t) implies that it is possible to find a recursive estimator, dependent 
only on the \\P\ optimal solutions up to time t. The number of solutions to preserve 
increases rapidly with model order, since an nth-order estimator is conditioned on 
the value of n successive states, requiring \\P\ n state sequences. 

Finally, the Viterbi method maintains \\P\ different state sequences, qualitatively sim- 
ilar to the multiple-hypothesis approaches of the ensemble, unscented, and particle 
Kalman filters of Section 4.3.4. However for the discrete- state case, the |^| state se- 
quences enumerate all possible optimal paths for a first-order system, whereas the 
multiple Kalman-like filters in Section 4.3.4 cannot possibly enumerate all paths, 
but increase the probability of a good estimate by improving robustness and better 
characterizing the estimator statistics in the presence of nonlinearities. 
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Fig. 4.5. The Viterbi algorithm, applied to the model of Figure 4.4. The model is initialized, 
followed by measurements over four time steps. Light lines show hypothetical state transitions, 
whereas the bold lines and probabilities correspond to the three state sequences preserved at 
each time step in {Imap(^I^)}- 
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Fig. 4.6. Satellite observations of the ocean surface temperature for the month of October 
1992: the measurements are accurate, in the sense of low-noise, however they are spatially 
located in narrow, sparse strips. 
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Fig. 4.7. Prediction: (a) The initial stationary correlation length of the prior model, which 
is updated with the four point measurements (b). After twenty prediction steps, (c) the error 
standard deviations and (d) the estimation error correlation length £(20|19) are both clearly 
nonstationary. 
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Application 4: Temporal Interpolation of Ocean 
Temperature [191] 



Consider the data shown in Figure 4.6: the ATSR satellite offers accurate observa- 
tions of the ocean surface, however the orbital pattern of the satellite, the narrow 
view, and the interference by clouds end up giving us measurements in sparse strips, 
certainly nothing like a map. 

What we want to do is to produce estimates over time, so that as measurements 
arrive we slowly fill in a dense map. Of course, the ocean surface is dynamic, so we 
would like something akin to a Kalman filter that takes the dynamics into account 
and produces dense estimates. A direct implementation of the Kalman filter is not 
possible, because the 512 x 512 size of the image is far too large, as it leads to a state 
vector with 262 144 elements, and correspondingly huge co variance matrices. 

In principle the prediction step can be implemented along the lines of the extended 
Kalman filter: have some sort of dynamic model which includes ocean-surface cur- 
rents etc, and use these dynamics to predict the estimates. 

It is the update step which causes us problems. If the prior model P(t\t — 1) at 
some time step t is spatially stationary, then there are some tricks that we can use 
to implement the update step efficiently; here the multiscale method of Section 10.3 
was used to solve the update step. However, as illustrated in Figure 4.7, if we have 
sparse measurements then the posterior P(t\t) becomes non stationary, thus giving 
us a nonstationary prior for which effective approximate modelling is much more 
difficult, at time t + 1. 

To solve this problem [191], the nonstationary spatial prior was approximated as a 
weighted mix of stationary models, as described in Section 5.6.2. The results, shown 
in Figure 4.8 show the estimates filling in the dense map over time, as more and more 
measurements arrive. 



Summary 

Given a dynamic Bayesian estimation problem 

z(t + 1) = Az(t) + Bw(t) w(t) ~ A/"(0, 1) (4.182) 

m(t) = Cz(t) + v(t) v(t) ~ A/"(0, R) (4.183) 

the optimum least- squares recursive estimator, the Kalman filter, can be formulated 

as 
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Fig. 4.8. Anomaly estimates and error standard deviations for the proposed estimator, applied 
to five months of ATSR sea-surface temperature data, starting in June, 1992. 
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Prediction: z(t\t - 1) = Az(t - l\t - 1) (4.184) 

P(t|t - 1) = AP(t - l\t - 1)A T + PP T (4.185) 

Update: |(t|t) = |(t|t - 1) + K(t) (m(t) - Cz(t\t - 1)) (4.186) 

P(t\t) = P(t\t - 1) - K(t) • C • P(t|t - 1) (4.187) 

K(t) = P(t\t - 1) • C T • (CP(£|£ - 1)C T + P)" 1 . (4.188) 

If matrices A,B,C,R are not a function of time, then we have the steady-state 
Kalman filter, in which only (4.184) and (4.186) are iterated. 

The smoothed estimates z (t \r) , P(t \r) can be computed recursively using the Kalman 
smoother. 



For Further Study 



A great many books have been written on Kalman filtering, of which the textbooks 
by Anderson and Moore [8], Grewal and Andrews [151], Hayes [156], and Mendel 
[231] are a representative sample. 

There is a Kalman filter toolbox available for MATLAB, and many open-source im- 
plementations can be found. 



Sample Problems 



Problem 4.1: No Measurements 

In the regular Kalman filter, if at some time t we have no measurements, then 
presumably the update step should have no effect. Prove that this is so. That is, if 
there are no measurements, show that the regular Kalman update step reduces to 

z(t\t) = z(t\t - 1) P(t\t) = P(t\t - 1), (4.189) 

meaning that we can skip the update step entirely at those iterations in which we 
have no measurements. 

Problem 4.2: Update-Update Form 

In the regular Kalman filter, substitute the prediction equations into the update 
equations in order to derive the update-update form of (4.82), (4.83). 
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Problem 4.3: Noncausal Filtering 

It is straightforward to implement a noncausal Kalman filter by augmenting the 
state, as in (4.143), such that z(t) is estimated based on measurements up to time 
t-\-r, where r is the degree of state augmentation in (4.143), (4.144). 

The reason for doing such estimation is, of course, because it is likely that future 
measurements are relevant to estimating the present state. That is, the estimation 
error covariance P(t\t + r) should decrease with increasing r. 

(a) Prove that P(t\t + r) cannot increase with r. That is, prove that 

P(t\t + r + 1) < P(t\t + r) for all r > 0. (4.190) 

(b) Under what conditions of the dynamic model A(t), B(t) is (4.190) met with 
equality, such that the error covariance does not decrease with r? 

(c) Under what conditions of the measurement model C(t), R(t) is (4.190) met 
with equality, such that the error covariance does not decrease with r? 

Problem 4.4: Filter Implementation 

Suppose we have the dynamic system 

z(0) -A/"(0,4) z(i) = az{t - 1) + w(i) w(i) ~ A/*(0, 1) (4.191) 

and measurements 

m(t) =z(t)+v(t) v(t) ~JV(0,4) (4.192) 

We can use the dynamics (4.191) to synthesize a process z(t) and then create 
measurements via (4.192), which can then be used by the Kalman filter to gener- 
ate estimates z(t\t). Iterate over twenty time steps < t < 20: 



(a) Set a = 0.9. Plot z(t), m(t), z(t\t), y/P(t\t). 

(b) Set a = 2.0. Plot z(t), m(t), z(t\t), y/P(t\t). 

Note that for a = 2 the dynamic system is unstable, however does the Kalman 
filter still do a good job in estimating z(t)l 

Our dynamic system is time-invariant, therefore the steady-state Kalman filter 
can be used. We assume that the Kalman filter is essentially in steady- state after 
twenty iterations, that is, that 

P(20|20)-P(oo|oo). (4.193) 

(c) Plot P(20|20) as a function of a for -3 < a < 3. 
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There is a great deal to observe and explain in this plot. For each of the follow- 
ing, give a short mathematical (equation-based) argument, but also a. descriptive, 
insightful explanation. 

(d) Why is the plot symmetric about a = 0? 

(e) Explain the value of P(20|20) at a = 0. 

(f) Why do we see asymptotic behaviour in P for large \a\l 

(g) Explain the value of P(20|20) at a = oo. 



Part II 



Modelling of Random Fields 



Multidimensional Modelling 



In principle, the extension to multidimensional z of the concepts of regularization 
from Chapter 2 and the static and dynamic estimators of Chapters 3 and 4 is perfectly 
straightforward. The only inherent limitation in the developed estimators is that the 
unknowns z and measurements m are required to be column vectors. Attempting to 
substitute matrices (e.g., a measured image) for m will yield meaningless results. 

It is easy, however, to reorder the image Z on a two-dimensional grid Q into a vector 
via lexicographic reordering: 



Z - [^3x3 - 









Lexicograph. 
Unlexicograph. 


z = z.A 


_ 2ll" 
221 

_233_ 


211 


212 


213 


Z21 


222 


223 


*31 


232 


233 









(5.1) 



There is flexibility, of course, in the specifics of how the image Z should be scanned 
into the vector; that is, whether by columns, as in (5.1), by rows, or in some other 
fashion. Unless specified otherwise, by default we will choose to stack by columns. 

Clearly the above approach extends trivially to higher dimensions, first stacking a 
three-dimensional cube by planes, and next the stacked plane by columns, however 
a certain degree of nervousness regarding the size of the resulting vectors is appro- 
priate. 

All of the previously-developed ideas of conditioning, regularization etc. apply, ex- 
cept that the matrices involved (e.g., covariance P or constraints L) possess more 
complicated structures due to the rearranged nature of the lexicographical stacking. 
Nevertheless, working with two- and higher-dimensional spaces introduces difficul- 
ties which we did not consider in previous chapters. This chapter begins with a dis- 
cussion of the challenges posed by higher-dimensional problems, followed by a dis- 
cussion of multidimensional deterministic and statistical models. 
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5.1 Challenges 



Ultimately, all of the challenges related to multidimensional data processing reduce 
to a problem of size. In particular, for aniVxiV image Z, the associated vector 
z = Z : is of length TV 2 , thus the associated covariance matrix P z is of size TV 2 x TV 2 . 
Even for a very modestly- sized image of 256 x 256 pixels, this corresponds to a 
covariance containing 2 16 • 2 16 = four billion elements. 

Given a large multidimensional problem, there are three operations which we need 
to be able to undertake, forming the primary challenges we need to consider: 

Storage: We need to store matrices (on disk or in memory). 

Computation: We need to compute matrix inverses, matrix-matrix products, and 
matrix-vector products. 

Modelling: We need to infer / specify certain model matrices (e.g., A, C, L, P, R). 

To the extent that the above three steps are practical, corresponding to problems 
having up to roughly ten-thousand elements (that is, a 100 x 100 pixel image), the 
methods of Chapters 3 and 4 are immediately applicable. For much larger prob- 
lems, such as the global sixth-degree 2160 x 1080 pixel oceanographic problem of 
Application 2, obviously much more care must be taken. In particular, the direct ap- 
plication of Chapter 3 to such a problem is computationally inconceivable, thus the 
problem either needs to be reformulated or made smaller, consequently leading to 
the following four challenges: 

1. Dimensionality reduction: How to reduce the size of the problem, or to decom- 
pose it into multiple smaller pieces. 

2. Efficient matrix storage: Methods for sparse matrix representation, to reduce stor- 
age requirements. 

3. Efficient computation: Associated methods for efficient computation with sparse 
matrices. 

4. Statistical modelling: How to model transformed problems, whether of reduced 
size, decomposed, sparse, or some other simplified or efficient representation. 

The remainder of this chapter surveys methods to address these challenges, with 
specific approaches developed in greater detail in the following chapters. 
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Decoupled Sequential or Tree or Cyclic Multilayer 

Causal Coupling Acyclic Coupling Coupling Coupled 

Fig. 5.1. The way in which state elements interact or are coupled strongly influences the diffi- 
culty of the problem. Five examples are shown here, in increasing order of difficulty. 



5.2 Coupling and Dimensionality Reduction 



The presence or absence of coupling between state variables plays a major role in the 
algorithmic complexity — the difficulty in formulating a solution — and associated 
computational complexity and numerical conditioning. 

Figure 5.1 shows five examples, in increasing order of difficulty: 

• In a decoupled problem the state elements do not interact, therefore there is no 
coupling to encode (reduced representation complexity) and each element can be 
modelled and processed independently. 

• In a sequential problem, the elements do interact, however the interaction is 
causal, meaning that the elements can be ordered from start to end. Some re- 
cursive approach, such as a variant of a Kalman filter, may be applied to such 
problems. 

• In trees or acyclic graphs (relationships having no loops) the behaviour is com- 
plicated by the less regular structure, and by the fact that multiple elements (the 
leaves of the tree) can influence one state (the root). Nevertheless the absence of 
loops allows generalizations of the Kalman filter to be applied to such problems, 
usually at very low computational complexity. 

• In cyclically coupled (loopy) problems, the bidirectionality of influence makes 
the problem much harder than the decoupled and causal cases. Many algorithmic 
solutions to such problems are iterative, updating first one state, then the other, 
then back to the first and so on. 

• Coupled multilayer problems are like coupled ones, but their heterogeneous form 
makes it more difficult to transform them into something simpler, whereas it may 
be possible to convert a regularly- structured coupled problem. 

With the exception of problems with unusually convenient structures (see, in partic- 
ular, Section 8.3), in approaching a multidimensional problem of any significant size 
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Fig. 5.2. A wide variety of approaches exists for addressing large multidimensional prob- 
lems. Above are sketched a quick survey of seven of the broadest approaches to 3D (top) 
and 2D (bottom) problems: direct, change of basis, reduction of basis, local processing, hi- 
erarchical processing, marching, and decoupling. Grey regions show changed or transformed 
coefficients; solid lines within dashed regions show a reduction in the number of coefficients. 



our goal is frequently one of simplifying the coupling attributes, specifically break- 
ing loops via some sort of decomposition or transformation, so that a looped graph 
becomes acyclic, whether tree-based, sequential, or decoupled. 

Although the details are left for other chapters to develop further, Figure 5.2 gives a 
quick overview of the most promising categories: 



(a) Direct (Chapter 3): The problem is set up in dense form, and solved directly 
per the usual equations. This approach is infeasible for all but the most trivially 
sized problems. 

(b) Change of Basis (Section 8.1): Although the size of the problem is not reduced, 
by reducing the correlation among state elements the overall conditioning of the 
problem is improved. 

(c) Reduction of Basis (Section 8.2): An extension of change of basis, in many con- 
texts the redundancy among the state elements implies that the original problem 
can be very accurately represented using a reduced state, essentially no different 
from the principle of image compression. 

(d) Local Methods (Section 8.2.3): A special case of reduction of basis, here each 
estimate is based only on a local pruned version of the original problem. Except 
for those problems with localized correlations, this approach is normally rather 
poor. 
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Figure 5.2(c,e,g) 

Basis Reduction + 

Decoupling + Hierarchy 



Fig. 5.3. Three examples of hybrid approaches to three-dimensional modelling, based on com- 
bining two or more methods from Figure 5.2: (b,f) applying a global change of basis and pro- 
cessing plane by plane using a marching approach, (e,f) applying a hierarchical approach to 
each of the planes in a marching approach; (c,e,g) decomposing the problem into indepen- 
dent planes, reducing the basis by eliminating most of the insignificant planes, and applying a 
hierarchical approach to each plane independently. 



(e) Hierarchical or Scale Recursion (Figure 5.5): A specialized change of basis, 
such that the elements can be organized into a hierarchy of scales. 

(f) Marching (Section 10.1): Applying a Kalman filter to one axis of a multidimen- 
sional problem, solving the problem iteratively over columns or planes. 

(g) Decoupling (Section 8.2.1): Apply an optimal change of basis, decoupling the 
problem into independent pieces. Although the optimal change of basis is too 
difficult to apply to a whole d-dimensional problem, if the problem is spatially 
stationary then the optimal one-dimensional decoupling may be found along 
some dimension, leaving an ensemble of simpler (d—1) dimensional problems. 

Clearly more than one approach may be used in a given context; three illustrations 
of such hybrid schemes are shown in Figure 5.3. 

The topic of hierarchical methods deserves some additional comment. There are, 
typically, two classes of motivations for a hierarchical method: 



Pragmatics : Hierarchical methods, particularly hierarchical changes of basis (such 
as wavelets) are often found to be very effective at improving problem condition- 
ing, and are thus chosen for their numerical properties. 

Philosophy: For most large-scale multidimensional problems, there is some recog- 
nition [110, 148, 273] that the fine-scale details in one part of the problem are 
unnecessary in modelling some distant part of the problem, as illustrated in Fig- 
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Fig. 5.4. (a) Knowing the details of individual, distant points is probably not necessary in esti- 
mating the value of a random field at point *. Instead, (b) a hierarchical approach is motivated 
whereby the details are aggregated into a much simpler form [148,273]. 
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Fig. 5.5. Three basic approaches to hierarchical modelling. Domain decomposition partitions 
the domain into small pieces, multiresolution methods represent a domain at coarse scales 
using relatively few coefficients, and nested aggregation groups small regions to produce fewer 
but larger ones. 
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ure 5.4. Instead, it may be preferable to aggregate these details (and then to repeat- 
edly aggregate the aggregations, leading to a hierarchy). 

Very generally, three classes of hierarchical methods exist, illustrated in Figure 5.5, 
all of which will be developed in further detail as appropriate: 

1. Domain decomposition: Also known as "divide-and-conquer," a large problem 
is repeatedly divided into some number of smaller decoupled subproblems. At 
each scale, only the boundaries of the decoupled subproblems are of interest, the 
interior of each subproblem having been solved at a finer scale. Nested dissec- 
tion (Section 9.1.3) and statistical multiscale (Section 10.3) are examples of this 
approach. 

2. Multiresolution: The problem is represented at a variety of resolutions, such that 
the coarse scales have only few state values representing low-resolution aspects 
of the random field, with progressively finer details at finer scales. Although su- 
perficially similar to domain decomposition, there are two significant differences: 

(a) A multiresolution approach normally does not decouple into multiple prob- 
lems at finer scales. There is one problem to solve at each scale. 

(b) Methods of domain decomposition normally do not represent a random field 
at varying resolutions. Although one talks about "coarse" and "fine" scales, 
the representation at the coarsest scale is normally a subset of the finest 
scale, not a low -re solution version of the finest scale. 

Hierarchical basis change (Sections 8.4.1 and 8.4.2) and the multigrid algorithm 
(Section 9.2.5) are examples of this approach. 

3. Nested aggregation: In both the decomposition and multiresolution approaches 
the geometry of the coarsest scale — that is, the extent of the coarse scale ele- 
ments on the finest scale — is predetermined. However, if a random field consists 
of irregularly- shaped patches, it may be inappropriate to represent this random 
field using large, square pixels at a coarse scale. In region aggregation, the coarser 
scales are constructed by the repeated iterative aggregation of finer scales, leading 
to arbitrarily- shaped regions whose shapes are driven by the finest-scale observa- 
tions. 



5.3 Sparse Storage and Computation 
5.3.1 Sparse Matrices 

In the event that the covariance P for an N x N image Z is completely nonstationary 
without any structure or regularity, there is not really any alternative to specifying the 
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complete, dense N 2 x N 2 matrix P. On the other hand, there is something that feels 
very wrong about needing to specify and store TV 4 / 2 parameters 1 for an image of 
N 2 pixels. 

Indeed, the solution in all cases is to assert, assume, or derive a model which leads 
to sparse matrices. In some cases this sparsity is obvious, such as with diagonal 
matrices. For example, very frequently it is reasonable to assume that measurements 
have uncorrelated errors, in which case R is diagonal, and so only the vector 

r T d = [rn . . . r nn ] = diag(E) (5.2) 

needs to be stored. As a limiting case, if the measurement noise is uncorrelated and 
also stationary, then R = a 2 1 and the entire matrix is represented by the single scalar 
variance a 2 . 

Although there is a large literature of sparse matrix techniques [141], many of which 
apply to arbitrary patterns of matrix sparsity, in general the regular pixellated nature 
of images and volumes leads to fairly regular structures. Figure 5.6 shows two such 
examples for 10 x 10 pixel images Z. Suppose that any pixel Zij is correlated only 
with some subset of its immediate neighbours; then the corresponding covariance 2 P 
is a diagonally banded matrix, where the bands are the consequence of the column- 
wise lexicographic reordering of Z. In these examples, the thickness of the main 
diagonal band (3 and 5, respectively) is determined by the number of vertically - 
related elements, whereas the number of band sets (also 3 and 5) is determined by 
the number of horizontally-related elements. 

Thus in Figure 5.6(a), rather than using some complicated sparse representation for 
the covariance, it is much simpler and more efficient to recognize the banded struc- 
ture of P, representing the b diagonal bands of P as a dense b x n matrix. 

Although sparse banded matrices sound very straightforward, the challenge to sparse 
modelling stems from the following: 

• The product of sparse matrices is normally more dense than the original, and 

• The inverse of a sparse matrix is normally dense. 

That is, the sparse structure of a given model tends to disappear very quickly. Much 
of the essence of the methods described in this book is the preservation, whether 
implicitly or explicitly, of model sparseness. 



1 The symmetry of a covariance matrix means that there are actually q(q + l)/2 parameters 
to specify for a q x q matrix. 

2 This example is for illustration purposes only. Most banded matrices are not positive- 
definite, and would therefore not be a valid covariance. In practice, it is covariance inverses 
which are banded, not the covariances themselves. 
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Fig. 5.6. Two illustrations of sparsity, both corresponding to a 10 x 10 two-dimensional do- 
main. The lower arrays show a lexicographic wrapping of one row of the top matrices, showing 
the (a) five or (b) nine elements whose interactions are being modelled. The explanation of the 
patterned thick middle bands is complex, to be discussed in Section 5.5, and is due to the 
choice of boundary conditions at the image boundary. 



5.3.2 Matrix Kernels 



In seeking to model large multidimensional domains, an extremely important special 
case is that of spatial stationarity, where the spatial statistics or the form of a dynamic 
model do not vary from one state element to the next. In such a case, defining the 
nature of the matrix at some location implicitly defines the matrix at all other loca- 
tions; such an implicit representation is referred to as a matrix kernel, and is far more 
sparse than any banded representation. 

Consider two-dimensional random field Z = {zij}, where the lexicographically 
reordered field has a given covariance 



z= [Z) : ~P. 



(5.3) 



If Z is spatially stationary then, ignoring the boundaries of the domain for now, the 
statistics are shift invariant: 



E[ z a,bZi,j] — E[Za+x,b+yZi+x,j+y\ — ^[^0,0^-a,j-fe] 



(5.4) 



from which it is clear that all of the field correlations can be inferred from the corre- 
lation of one element 2q,o with the rest of the random field Z. That is, the covariance 
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kernel V is 

V±E[z ,oZ\. (5.5) 

If ^o,o is the first element of [Z], and 2i the first column of covariance P, then 

Ul = E[z ,o[Z}.] (5.6) 

and the kernel may be succinctly represented as 

m=2i (5.7) 

and we see that the kernel is nothing more than the reordering of the first column 
of the covariance. In the unusual case that the correlations are zero beyond some 
spatial offset q, then the kernel can be shrunk to represent only the local, nonzero 
correlations: 



V = E{ 



/ 


z -q-q 


' Z 0-q * * 


z q-q 


x 


^o,o • 


Z-q,0 " 


• ^o,o •• 


' Z q,0 


> 


»w 


_ z -q,q 


• ^0,g •• 


z q,q . 





(5.8) 



The kernel V contains everything needed to reconstruct any entry of covariance P: 

(5.9) 



r ,_f if \a - i\ > q or \b-j\>q 

[a ' b ijl ~ \ Pi-aj-b if \a - i\ < q and \b - j\ < q. 



A second important class of matrix kernels corresponds to linear dynamic matrices. 
Suppose we consider dynamics of the form 



z(t + l) = Az(t). 



(5.10) 



If the dynamics are spatially stationary, 3 then how A acts to determine z a ^(t + 1) is 
just a space-shifted version of how A acts to determine z a+ i^-2(t + 1). That is, we 
can define a set of weights a^j such that 



*a,6(* + l) = ^2^2aiJ z a+i,b+j(t). 



(5.11) 



We recognize this as a convolution; collecting the weights aij into a kernel A, we 
have a succinct version of (5.1 1) as 



Az(t) = [A*Z(t)] s . 



(5.12) 



As written, the dynamics in (5.10) are also temporally stationary, however this is not re- 
quired. It is perfectly reasonable to have a spatially stationary time-varying kernel A(t). 
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The compactness and elegance of this kernel approach make it particularly useful 
in the development of deterministic (Section 5.5) and stochastic (Section 5.6) mod- 
els, particularly because, in contrast to the dynamics matrix A or covariance P, the 
kernels *4, V have the same numbers of dimensions as Z and allow a fairly intuitive 
understanding of the implied dynamics or statistics. Ideally, we work entirely in the 
kernel domain, without lexicographical reordering at all, such as 

Z(t + 1)=A* Z(t) V = E [z Q Z] . (5.13) 

The shorthand A * Z should not quite be taken literally; indeed, the convolution 
(Appendix C. 1) A* Z produces a result on a larger domain than that of Z. Instead, the 
convolution is understood to apply over most of the domain, with certain exceptions 
at the boundaries. 

To elaborate further on this point, it is important to realize that matrix kernels can 
be applicable in problems which are spatially nonstationary (as shown, for exam- 
ple, with boundary conditions in Section 5.5.1). Although arbitrary nonstationarities 
would require preserving the entire dynamics A or covariance P, if most of the field 
is stationary, characterized by a single kernel, with only rare exceptions (such as 
at the edges of the domain, or at region boundaries within the domain), then it is 
possible to characterize the nonstationary behaviour by a relatively small number of 
stationary kernels. 

Strictly speaking, matrix kernels do not need to be rectangular in shape and only 
need to store nonzero elements. When we sketch a kernel it is often convenient to 
show only these nonzero elements, however in implementation it is usually most 
convenient, computationally, to store the kernel as a small, dense, rectangular array. 



5.3.3 Computation 

The dominant component of computational complexity invariably revolves around 
matrix multiplication, matrix inversion, and (rarely) matrix eigendecomposition or 
positive-definiteness testing. We summarize below the computational issues associ- 
ated with dense, banded, and kernel matrix representations. 

Dense Matrices 

Given dense matrices A e R nxk ,B e R nxk ,C e R kxp ,D e R kxk , then the 
following complexities are well known: 

• Computing A + B has complexity 0(nk), 

• Computing A © B has complexity 0(nk), 
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• Computing A • C has complexity 0(nkp), 

• Computing D~ x has complexity 0(k 3 ). 

It should be mentioned, in passing, that faster algorithms are known for both multi- 
plication and inversion. The oft-cited Strassen's method [294] which computes a ma- 
trix inverse in G(k log2 7 ) ~ 0(k 2 - 807 ), and faster methods to approximately G(k 2A ) 
have been developed, all based on fast methods for matrix multiplication [247]. In 
general, however, the implementation complexity of these methods limits their use. 

The computation of a matrix determinant is a more subtle question than matrix ad- 
dition or multiplication. In particular, because the matrix determinant may be used 
to address questions of matrix singularity or positive-definiteness, it is important to 
establish whether a determinant is zero, positive, or negative. Therefore numerical 
rounding errors in determinant computation may be unacceptable, so there exists 
literature [1] on exact determinant computation for matrices having integer entries, 
with a complexity of approximately 0(n 4 In n). In practice, the MATLAB evaluation 
of a matrix determinant for a nonsingular n x n matrix is approximately G(n 3 ). 

Given an n x n matrix, testing its positive-definiteness involves the application of 
Sylvester's test (Appendix A.3), requiring the computation of determinants of 1 x 1, 
2x2, . . . n x n matrices, for a total complexity of 0(n 4 ). 

Banded Matrices 

If matrices are sparse, then the computational issues are somewhat different. For a 
banded, sparse n x n matrix A, let 

v ' (^ 1 if, for at least one j, dj+ij ^ ' ' 

That is, Ba{t) indicates whether diagonal band i is present in matrix A. 
Then bands from either matrix contribute to the sum in matrix addition: 

B A + B = sign(B A + B B ), (5.15) 

and matrix multiplication corresponds to a convolution in band space: 

B AB = sign(B A *B B ), (5.16) 

a result that is easily proved: 

Given C = A • B => c%k — z_\ a ijbjk (5.17) 

3 

from which it follows that 
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[©] *[l©l] = [l©l] [l©l] * [l©l]= [l 3 © 3 1 
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Fig. 5.7. The convolutional nature of sparse banded-matrix multiplication. Each matrix rep- 
resents a lexicographically ordered two-dimensional problem; thick bands indicate inter- 
actions between vertically-adjacent elements, spaced bands indicate interactions between 
horizontally- adjacent elements. The circled elements indicate the kernel origin. 



B c (i-k) = l if ^2B A (i-j)B B U-k)>0 (5.18) 

3 

B c (i)= sign\j2 B A(i-j)B B (j)} (5.19) 

3 

= sign{B A (i)*B B (i)}. (5.20) 

Four illustrations of this convolutional behaviour are shown in Figure 5.7. As each 
band in a banded matrix is associated with a single entry in its associated kernel (with 
the main diagonal corresponding to the kernel origin), that a convolution appears here 
is directly related to the multiplication of matrix kernels, discussed below. 

Banded matrix inversion is somewhat more problematic. Whereas the product of 
two banded matrices yields another, slightly denser, banded matrix, the inverse of a 
banded matrix is typically full. It is possible, however, to formulate sparse, banded 
approximations to the matrix inverse [278]. A simple approach is to write a matrix 
P = D + N in terms of its diagonal band D and nondiagonal bands TV. If P is diag- 
onally dominant, meaning that D~ x N is stable, then there is a series approximation 
toP-\ 

P- 1 = D' 1 - D^ND' 1 + D^ND^ND- 1 - . . . , (5.21) 

where the number of series terms depends on the degree of diagonal dominance. That 
this series is valid is easily seen by computing the product 

P ■ P- 1 = (D + N) • ( D' 1 -D^ND' 1 ^D^ND^ND- 1 - . . .) 
I -ND- 1 ^ND^ND' 1 -... 

+ND- 1 -ND^ND' 1 +... 

(5.22) 
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If N is 6-banded, then the series builds up the dense matrix inverse by summing 
smaller and smaller terms having more and more bands: 

P- 1 ~ D' 1 - D^ND' 1 + D^ND^ND- 1 - . . . (5.23) 

1 band b 2 bands 6 4 bands 



Matrix Kernels 

Basic matrix operations can be undertaken directly in the kernel domain (Sec- 
tion 5.3.2). As much as possible, all spatially stationary parts of a spatial statistical 
problem should be kernel-represented. 

If we take the sum of two matrices then 

(A + B)x = Ax + Bx=[A* X], + [B * X]. = [(A + B) * X] : (5.24) 

thus the kernel of (A + 5) is (A + B). 
The kernel corresponding to matrix multiplication is derived as 

(A.B)x = A-[B*X} : = [A*[[B*X].] nxn \ t = [A*B*X] : (5.25) 

thus the kernel of (A • 5) is (A*B). 

Because a scalar is unchanged when transposed, we can derive the kernel for L T as 
follows: 

x Ly_ = {x Ly_) T = y_ L T x 

x T [£*Y} : = U T [£ T *X]. (5.26) 

E* a W Ej 2/0')^(* - J) = Ej 2/0') Ei x(i)C T (j - i) 

from which it follows that a matrix transpose corresponds to a mirror image in the 
kernel domain: 

C(i) = C T (-i). (5.27) 

If a matrix is circulant (stationary, with periodic boundary conditions) then the kernel 
corresponding to the inverse matrix can be found using the FFT, as described in 
Section 8.3: 

A -i = FFT-^l <2>FFT n (A)). (5.28) 

Since the FFT diagonalizes A, from the eigenvalues A = FFT n (*4) the positive- 
defmiteness of the associated matrix A can be tested as 

A>0 iff Aij >0 \/iJ. (5.29) 
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Matrix Inversion 

We need to develop a skepticism any time large matrix inverses are called for. It is 
rare that an explicit inverse is needed and in most cases the problem can be reformu- 
lated. 

However given a prior P, if the inverse P _1 is needed then direct inversion may 
be unavoidable. Most attractive is state decoupling (Chapter 8), leading to a block- 
diagonal matrix, in which case the matrix inverse P _1 is found by inverting each of 
the diagonal blocks individually: 



P 



Other solutions include state reduction (Chapter 8), series approximation (5.21), 
approximating the field as stationary and using kernel methods, or based on some 
sort of restricting assumption, such as a banded inverse [10] or a block-banded in- 
verse [1 1]. The latter methods are assumption-dependent and have a sufficiently com- 
plex implementation to put them outside the scope of this text. 

If the matrix is part of solving estimates, such as in 

z(m) = (L T L + CTR^C) _1 C T R- X m, (5.30) 

then the inversion of {L T L + C T R~ 1 C) is avoidable. Instead, we should rewrite 
the problem as a linear system 

(L T L + CTR^C) z(m) = C T R- X m (5.31) 

and solve this using Gaussian elimination, a Cholesky decomposition, or iterative 
methods, all of which are discussed in Chapter 9. The advantages of these latter 
approaches are in terms of computational complexity, storage complexity, and nu- 
merical robustness. 

Complexity Comparison 

Table 5.1 summarizes the computational complexity for the dense, banded, and ker- 
nel matrix types just discussed. The challenge in solving large multidimensional 
problems is not to find faster ways to invert huge co variance matrices, rather how 
to design or discover effective models which lead to attractive properties, such as 
sparsity, handedness, block-diagonal matrices, diagonal dominance, or kernels. 
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Operation 


Matrix Type 




Dense Banded Kernel 


Matrix Storage 

Matrix-Matrix Sum 

Matrix-Matrix Multiply 

Matrix- Vector Multiply 

Matrix Inversion 
Positive-Definiteness Test 


0(n 2 ) G(bn) G(q) 
G(n 2 ) G(bn) 0(q) 
(D(n 3 ) 0(b 2 n) G(q) 
G(n 2 ) G{bn) G(qn) 
G{n 3 ) G{n 3 ) G(nlogn) 
G(n 4 ) 0(n 4 ) Olnlogn) 



Table 5.1. A comparison of computational complexity for dense, banded, and kernel matrix 
operations. In all cases the underlying problem z has n elements, banded matrices have b 
bands, and kernels have q nonzero elements. 



5.4 Modelling 

The previous section discussed mostly pragmatic matters — how to store and mathe- 
matically manipulate large matrices. However, this discussion is largely abstract and 
artificial, playing games with matrices, unless it can be connected, in a meaningful 
fashion, to the solution of large inverse problems of interest. 

The challenge of modelling is not just one of specifying relationships, such as the N 6 
interrelationships among the elements of an N x N x N cube; it is fairly straight- 
forward, after all, to invent or guess functions which assert a statistical relationship, 
such as an autocorrelation for a stationary model — 

f(i,j,k) = E[z(0,0,0)z(i,j,k)]. (5.32) 

But not all choices of / make sense, or are even valid. Rather, problem modelling 
simultaneously needs to satisfy four criteria: 

1 . The model must be legal or valid, meaning that derived or implied covariances 
P, R or covariance inverses P _1 must be positive-definite. 

2. The model should not be overfitting training data. 

3. Ideally, the structure of the model should be compatible with an efficient mathe- 
matical solution, aspects of which were discussed in Section 5.3. 

4. The model must be faithful to the problem. To the extent that the model is an 
approximation, the errors which appear should be negligible or tolerable. 

The above criteria are individually straightforward, but challenging as a whole: the 
most obvious models tend to be invalid, the efficient and elegant models tend to be 
inaccurate, and the accurate models tend to be computationally infeasible. 
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By far the most serious problem in model specification and development is that of 
positive-definiteness : 

• If a prior covariance P is negative-definite, even to only a tiny degree, the esti- 
mation results can become completely meaningless, and may not necessarily bear 
any resemblance to an approximate solution. 

• Given a hypothesized prior covariance P, verifying that it is positive-definite is 
extremely difficult, requiring the application of Sylvester's test (Appendix A.3) 
or computing all of the eigenvalues of P. 

In other words, it is crucial that P be positive definite, and it is essentially impossible 
to check the positive-definiteness of P! 

There do exist methods [48], known as covariance extension, which can form a 
positive-definite covariance given a few correlations. Unfortunately this problem re- 
mains unsolved in higher than one dimension, and is computationally inconceivable, 
in any event, for problems of significant size. 

Additional challenges to the modelling process are the extra complications added 
by the lexicographic reordering of the pixel elements. Furthermore the choice of a 
suitable model can depend subtly on the problem context, thus the actual process by 
which a model is selected is highly creative and therefore difficult to describe in a 
structured, step-by- step fashion. 

We focus on the construction of models which are known to be positive-definite, so 
that eigendecompositions are not required for validation, and then to choose among 
these available models for suitable choices in a given problem context. 

The next two sections examine "deterministic" and "statistical" models, respectively. 
The former, following on Section 3.1, seeks to specify a set of constraints L or L T L, 
as required by (3.32). The latter, following on Section 3.2, seeks to specify prior 
models, possibly P in the case of (3.65) or P _1 for (3.67). 

From Section 3.2.4 we know, of course, that there are both conceptual and algebraic 
overlaps between these two classes of models — that is, to some degree we can asso- 
ciate L T L with P _1 . In particular, as will become clear later, there are relationships 
between the local deterministic models of Section 5.5 and the Markov statistical 
models of Section 5.6.4. 



5.5 Deterministic Models 



By deterministic modelling, we mean the assertion of a set of constraints which are 
compatible with the types of behaviour that are expected in z 9 but we do not neces- 
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sarily believe that these constraints represent a complete description of the statistics 
of z. That is, the constraints serve to regularize and condition the problem, not so 
much to model it. 

We assert a set of constraints {l { z}, equivalently penalizing 



z T L T Lz, 



where L 



l T 

Lkq. 



(5.33) 



Thus L is a rectangular matrix, where the number of columns equals n, the dimen- 
sionality of z, and where the number of rows q equals the number of constraints. 
Indeed, the ordering of constraints (rows) in L is completely arbitrary, which is both 
intuitive and easily verified by seeing that L T L is unaffected by row-reordering in L. 

The key to this approach is its guaranteed positive-semidefiniteness. Recalling the 
definition of a positive- semidefinite matrix Q: 



> 



x Q x > Vx., 



(5.34) 



thus 

x T L T Lx = (Lx) T (Lx) = U T U >0 %. (5.35) 

That is, any real matrix product L T L is, by definition, positive- semidefinite. 

The most common choices for L involve local constraints; that is, each constraint is 
a function of only a few spatially proximate elements of z. Although locality is intu- 
itively appealing it is not of great significance computationally; indeed, any locality 
in two- and higher-dimensional problems is quickly destroyed by the lexicographic 
reordering. Rather, the key is that each constraint be sparse, involving only a small 
number of elements of z, whether spatially local nor not. 

The starting point for L is normally the discretization of simple variational con- 
straints, which were introduced in Section 2.4.1: 



Membrane: 



Thin Plate: 



\Lz\\ 



\Lz\\ 



dx 



dy 



dxdy 



\L x z\ 



\\LyZ\\ 



d 2 z 
dx 2 


2 

+ 2 


d 2 z 
dxdy 


2 
+ 


d 2 z 
dy 2 



I L xx z\\ 



2 \\L xy z\\ 



dxdy 



\LyyZ_\\ , 



(5.36) 

(5.37) 

(5.38) 
(5.39) 



where L x ,L y , L xx , L xy , L yy represent first- and second-order difference constraints 
approximating the desired partial derivatives. For example, returning to the 3 x 3 
example of (5.1), 
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there are a total of six first-order horizontal and six first-order vertical constraints: 



-1 



1 



-1 1 
-1 1 



-1 1 
-1 1 



-1 1 
-1 1 



L x Ly 

(5.40) 
The band spacing in L x and the staggered bands in L y are due to the lexicographic 
structure of the one-dimensional vector z. The constraints are much more easily in- 
terpreted if the rows of L are lexicographically unwound: 



{i x )l =[-100100000]^ [{ul} 



3x3 



~-l 


1 


0" 





















(5.41) 



a difference constraint on two horizontally adjacent elements, and 



(l y )l =[0-11000000]^ %)l] 



3x3 



"0 





0" 


-1 








1 









(5.42) 



a difference constraint on two vertically separated ones. Note that if a given con- 
straint set, such as L x , is stationary, then it is much simpler and more intuitive to 
think in terms of its generator or kernel: 



L x z = (C x * Z). 



(5.43) 



a two-dimensional convolution, in which case L x ,L y , L xx , L yy , L xy are represented 
more compactly and elegantly via their corresponding kernels: 



C x = -1 1 



C xx = -1 2 -1 



-1 



r 

J ^x 



-1 1 
1 -1 



(5.44) 
(5.45) 



Rather than expressing the terms L x , L y , . . . separately, we can consider the com- 
plete set of constraints 



L x 

Lr,, 



or 



v2L x 



(5.46) 



152 5 Multidimensional Modelling 



=«sMembrane — 




First- Order 
Membrane 



Qpiate = 

+ £yy * ^yy~\~ 
^L-'xy * *^xy 



ksMixed — 
"x * ^x + L>y > 
x * *^xx n^ L-'yy 



* Lyy-\- 



^*^xy * ^ccy 



1 




1 


2-8 2 




2-9 2 


1 -8 © -8 1 




1 -9 (24) -9 1 


2-8 2 




2-9 2 


1 




1 



Second- Order 
Thin-Plate 



Fir st/S econd- Order 
Mixed 



Fig. 5.8. Three typical constraint kernels Q, derived from the given first- and second-order 
pixel constraints. The manner of the derivation is shown in (5.49). 



for the first and second-order cases, respectively, although obviously other choices 
of L are possible. In solving the estimation problem (3.32) we need to represent 
Q = L T L. In some cases, particularly for stationary problems of modest size, it 
may be more convenient to store Q and not L itself. Given a number of constraints 
Li, Z/2, . . ., we may find Q as 



'Li 

L 2 



Q = L T L = ^LjL t . 



(5.47) 



Analogously to before, if the constraints are stationary then we can consider the more 
intuitive convolutional kernel representation Q for Q, such that 



Qz=(Q*Z) : = ((£$*&) *z\ 



(5.48) 
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-i 



(5.49) 



As a simple, two-dimensional example, 

QMembrane = £> x * £>x + C y * C y = [l -l] * [-1 l] 

= [-!©-!] 

Two further kernels, computed in the same way, are shown in Figure 5.8. 

As discussed in Chapter 9 (see also Problem 5.5), it is important to realize that for 
very large, nonstationary problems it may be preferable to store L and to leave Q 
implicit, such that the product Qz is evaluated as 
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Fig. 5.9. Boundary conditions are implicitly specified by the presence or absence of constraints 
for pixels at or near the domain boundary. Four typical examples are shown here of a 4 x 
4 pixel domain with constraints shown as lines. Specific examples of kernels are shown in 
Figures 5.10 and 5.11. 



Qz = L T Lz= (l t (Lz)\ 



(5.50) 



The reason for preferring L is that it is typically sparser than Q, and so more efficient 
to represent. Furthermore, once L has been specified, there is no compelling reason 
to explicitly calculate the matrix-matrix product Q = L T L. Finally, if in trying to 
avoid the matrix-matrix product we try to specify Q directly, there are challenges in 
guaranteeing its positive-definiteness. 



5.5.1 Boundary Effects 

For any field of finite extent, it is required to specify the behaviour of the field at 
its boundaries. The details of these boundary conditions will be highly problem- 
dependent, but will normally fall into one of the categories shown in Figure 5.9: 



Free Boundary: All constraints which would refer to pixels outside of the lat- 
tice are removed from L, therefore the boundary asserts no constraint (the state 
element is "free"). With these constraints removed the resulting kernel becomes 
nonstationary and varies in a predictable fashion, as illustrated in Figure 5.10. 

Zero Boundary: In contrast to the free boundary, above, if all constraints which 
would refer to pixels outside of the lattice are truncated, as illustrated in Fig- 
ure 5.11, then the truncated portions of the constraints implicitly set the state val- 
ues outside of the boundary to zero. 

Specified Boundary: In some cases we may wish to assert boundary values other 
than zero, for example to specify the gradient at the boundary, as illustrated in 
Figure 5.12. Let B be a specified set of boundary values around random field Z. 
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Fig. 5.10. The free boundary kernel: the squared constraints (Q) relating to pixels outside of 
the domain have been removed, resulting in a nonstationary kernel. In all cases the circled 
element denotes the kernel origin. Compare with the boundaries in Figure 5.11. 
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Fig. 5.11. Kernels at boundaries: if the kernel Q is truncated (left), then there is an implicit 
reference to zero-valued pixels, imposing a boundary condition. If the kernel Q is wrapped 
(right), then the surface is constrained to be periodic. 



Then a given constraint Z, which spills from the random field into the boundary, is 
asserted as 



(5.51) 



By separating the portions of the constraint applied to Z and B we can rewrite the 
penalty (5.51) as 



T ,T1 



[Si 



l z z + Lbk 



F z z + p 



(5.52) 



Collecting all of the constraints into matrix form, we have the net penalty 
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Global Gravitational Equipotential: 

Periodic East/West Boundaries 

(Data Source NGA Office of GEOINT Sciences) 



Fluid Flow at a Coastline: 
Zero Flow Normal to Boundary 



Fig. 5.12. Boundaries appear very frequently in remote sensing. The Geoid, left, has east 
and west sides adjacent (periodic), whereas the top (north) and bottom (south) edges are not 
periodic. Fluid flow near a coastline, right, is constrained to have a zero flow component at 
right angles to the coast. 



liZz+m 



(5.53) 



so we see that the imposition of specified boundary conditions appears similarly 
to that of a prior mean; the solution of (5.53) is discussed in parallel with the 
discussion of prior means in Section 5.5.3. 

Periodic or Wrapped Boundary: State elements at opposing edges (top / bot- 
tom, left / right, etc.) are treated as adjacent, illustrated in Figure 5.11. Thus a 
reference to a pixel outside of the lattice is wrapped back inside, so for a K x N 
lattice, 

z(ij) = z(i mod K J mod N). (5.54) 

Although such models are rarely perfectly realistic, they are popular because of 
the simplicity and computational efficiency of stationary models. Having left-right 
periodic boundaries is very common when estimating on a sphere, such as global 
remote sensing of the Earth, as shown in Figure 5.12. 

Clearly the above boundary types can be combined, in that part of the boundary may 
be free, another part may be zero, two sides may be periodic, etc. 



5.5.2 Discontinuity Features 



With the exception of the boundaries we have, thus far, implicitly assumed stationary 
constraints. Although the inference of a nonstationary model can be quite difficult, 
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and beyond the scope of this discussion, the assertion of known nonstationarities in 
a model is straightforward. 

Indeed, the constraints at each state element should reflect our understanding of 
smoothness (or not) at each location. As long as they can be encoded in terms of 
linear constraints, there are no restrictions on the types of nonstationarities to be as- 
serted. A few examples follow, in which a single one-dimensional constraint L is 
illustrated, although the concepts are equally applicable to 2D and 3D. In each case 
the stated property occurs at the point marked by an arrow: 



First-order smoothness (no abrupt change in value) between 
the marked elements. 

Second-order smoothness (no abrupt change in slope) at the 
marked state element. 



11 



1t 



-1 


2 


-1 



State elements to the left and right are smooth, however a fold 
(abrupt change in slope) may occur at the marked state element. 
That is, the value of the marked element is constrained to be 
consistent with those of its neighbours, however there are no 
constraints on its slope. 

The two state elements to the left and to the right are con- 
strained to be similar, however between the marked elements 
a jump discontinuity (an abrupt change in value) may occur, 
since there are no constraints relating the corresponding state 
values. 



-1 


1 










-1 


1 



The use of the preceding elements is illustrated in Figure 5.13. 



5.5.3 Prior-Mean Constraints 



The standard membrane (first-order) and thin-plate (second-order) models consist 
strictly of terms constraining the relative values of proximal pixels, but no constraint 
of any kind on the value of any single pixel. Thus from the perspective of the regu- 
larizing constraints L, the overall image mean is irrelevant: 



\\Lz\\ = \\L<jil + z) 



(5.55) 



for any constant, scalar mean \i. In other words, the overall image mean lies in the 
nullspace of L, thus clearly L T L is singular. 

To remove the singularity of L T L, or in those cases where we wish to assert a prior 
mean, we need to introduce a constraint on an individual state element or, more 
commonly, to introduce a constraint on each element. Given a prior mean ji for each 
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Fig. 5.13. A detailed example of boundary conditions and first- / second-order constraints. 
For simplicity, the constraints L are illustrated for a one-dimensional example, so there are 
no lexicographic effects. Clearly L describes the constraints for a twelve-element state z, so 
the sketch in the lower panel is essentially a cartoon, a continuous one-dimensional estimated 
signal, consistent with the constraints and the nine measurements (open circles). 



element in z and possible specified boundary conditions Q_ from (5.53), we wish to 

penalize 

r rl X a\ 

(5.56) 



[al_ 


z — 


Ml 



leading to a weighted least- squares formulation 

z = (C T WC + L T L + a 2 1) _1 (C T Wm + L T Q_ + u) . (5.57) 

The addition of diagonal constraints corresponds to increasing the value of the central 
element in the kernel domain, as shown in Figure 5.14. As a increases the variance 
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Fig. 5.14. To constrain the value of a pixel, either to limit its variance or to assert a prior mean, 
we can choose to introduce a zeroth-order constraint, increasing the central element of the 
kernel by some amount a 2 . 
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Fig. 5.15. The effect of asserting a prior mean: for the two common smoothness models under 
consideration, as a 2 in Figure 5.14 increases the process correlation length is decreasing. 



of each state element decreases, as does the spatial correlation length. Figure 5.15 
plots this correlation length for the membrane and thin-plate priors in one, two, and 
three dimensions. 



5.6 Statistical Models 



The previous discussion focused on deterministic models for regularization. Be- 
cause these models are mostly heuristically chosen, and guaranteed to be positive- 
semidefinite, the discussion was straightforward and mostly intuitive. 
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Example 5.1: Multidimensional Interpolation 

We consider a two-dimensional estimation problem. We are given four measure- 
ments, and choose to assert second-order (thin-plate) smoothness by asserting the 
kernels C xx , C xyi C yy of (5.45) at each pixel. Only the boundaries remain to be 
specified; the following figures show the estimation results for the cases of free 
and periodic boundaries: 





Free Boundaries 



Periodic Boundaries 



The surfaces show the resulting estimates, with the underlying images plotting the 
estimation error variances, clearly identifying the four measurement locations by 
their low variances (dark). Note the response in the error variances on the right- 
hand side of the periodic surface to the measurements on the opposite (left) side 
of the domain, compared to the absence of such a response for the free boundary. 

A much richer example is constructed below, consisting of two periodic bound- 
aries, one free boundary, one zero boundary, a cut (zeroth-order discontinuity) and 
a fold (first-order discontinuity): 




FreeB °u^ 



Domain Structure 



Estimation Results 



One can observe how the reduction or elimination of constraints by the the cut 
and fold leads to a corresponding increase (white) in the error variances. Because 
the zero-boundary condition is a strong constraint, it greatly limits the variability 
of the surface and thus leads to a low error variance (dark, back-right), in contrast 
to the opposing free boundary with high error variance (light, front-left). 
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For the remainder of the chapter our focus shifts considerably to specifying statistical 
prior models. We realize that the constraints L implicitly specify the statistics of z, by 
identifying L T L with P _1 . However, in the implicitness of the specification lies the 
problem: given a random field Z, with known or sample statistics, it is not clear how 
to practically formulate the associated L. Algebraically the formulation is simple: 

P o P~ 1 = VDV T o L = D 1/2 V T . (5.58) 

That is, we solve for the matrix square root L, computationally impossible for all 
but the most trivially- sized problems. Furthermore, an arbitrarily chosen P and the 
derived L may not permit compact or efficient representation and storage. 

Even the direct specification of P carries with it the burden of guaranteeing positive- 
definiteness, so there remains an interest in implicit statistical methods which specify 
the model through the co variance-inverse square root P, particularly so for nonsta- 
tionary problems (see Problem 5.5, for example). 

The following four sections summarize methods for prior model specification, with 
ideas developed in greater detail in later chapters as appropriate. The first two models 
are explicit, specifying P; the latter two are implicit, specifying L or P _1 . 

• Section 5.6.1: Dense P — Analytical positive-definite forms 

• Section 5.6.2: Dense P — Positive-definite forms for nonstationary fields 

• Section 5.6.3: Sparse P, A — Square root / dynamic models 

• Section 5.6.4: Sparse P _1 — Banded inverse-covariance models 



5.6.1 Analytical Forms 

Consider setting up an estimation problem in d dimensions, characterized by a sta- 
tionary correlation function 

d 

E[z(x)z(x + S)] -E[z(x)]E[z(x + S)] = a 2 f(r), r 2 = J2 5 l (5 ' 59) 

i=l 

That is, x and 5 are a point and offset, respectively, in <i-dimensional space, and r 
measures the Euclidean length of offset 5. So in two dimensions, the statistics are 
characterized by a stationary correlation matrix 

E[z(0,0)z(x,y)]-E[z(0,0]E[z(x,yj\ = P(x,y). (5.60) 

There are only a few analytic forms for / which are guaranteed to be positive-definite 
[76, 162, 163, 248], plotted in Figure 5.16. Each is controlled by a single parameter 
6, which determines the range or correlation-length of the function. 
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Fig. 5.16. Six common positive-definite forms, shown for two or three values of width param- 
eter 0. The smoothness, numerical conditioning, and spatial extent of these are very different 
from one to the next, as summarized in Table 5.2. 



Exponential: 



/ex P (V,#) =exp 



\r\ 



(5.61) 



Very well conditioned numerically, even for highly-correlated (large 6) domains, 
however the sharpness of the peak at the origin causes the locations of sparse 
measurements to induce corresponding peaks in the resulting estimates, normally 
an undesirable artifact. 



Spherical: 



[ 1 _ 3 (r 

/sph(r,6>)4J 2 



r > 



(5.62) 



Similar conditioning and peak- sharpness properties to the exponential case, how- 
ever the spatial extent of the correlation is strictly limited to a finite range r < 0. 



Gaussian: 



/GauM)^exp(--J (5.63) 

Extremely smooth, but also leading to highly ill-conditioned covariance matrices. 
Logistic: 
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Or 2 

/LogM) - 1 ~2T^ (5 ' 64) 

Similar to the Gaussian case — smooth and ill-conditioned — but faster to com- 
pute. 



Bessel: 



a (2r/20)" 



A whole family of correlation functions, corresponding to the statistics of vari- 
ous orders of Laplacians, the Bessel correlations are smooth and very well con- 
ditioned, but of much greater computational complexity. The first-order case is 
comparatively simple and worth noting: 

/Se.M) = £*1 (£) ■ (5.66) 

Separate from the preceding five functions there are two special cases, corresponding 
to the limiting behaviour of the above functions as 6 — > and 6 — > oo: 



Independent: 



Mr) = \l r ° (5-67) 

r > 



As 6 — > each element becomes uncorrected with every other. Normally not a 
useful model on its own, however see the discussion below regarding the nugget 
effect. 

Constant: 

iconst(r) = 1 (5.68) 

As 6 — > oo, all of the state elements become perfectly correlated; that is, the entire 
random field is constant. The model is positive-semidefinite and not useful on its 
own, however see the kriging discussion, below. 

A summary of the properties of the positive-definite forms is shown in Table 5.2, 
with a short example in Figure 5.17. 

Finally, because the positive linear combination of positive-definite matrices is itself 
positive-definite: 

A,B>0 => aA + f3B > fora,/3>0 (5.69) 

therefore any linear combination of the above correlation functions remains positive- 
definite. Of particular interest is the common form 

/Nugget(r) = a/ Ind (r) + /3/*(r) (5.70) 
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Table 5.2. A summary comparison of qualitative properties of the analytical positive-definite 
forms. None of these can be said to be objectively superior to the others. 
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Fig. 5.17. Three illustrations of spatial estimation based on analytical forms, based on the 
scanning pattern of satellite measurements (as in Figure 1.3 on page 5). The Gaussian results 
are numerically unstable due to ill-conditioning; the spherical results suffer from a local cor- 
relation; the exponential results are reasonably credible, however in general the exponential is 
like a first-order (membrane) prior and provides insufficient smoothing. 
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for any valid correlation /*, which models the so-called "nugget" effect of the krig- 
ing literature [75,76] (see Section 2.5.3), in which there is a certain degree of instan- 
taneous decorrelation (controlled by a) between a random element and any other, 
regardless of how close. That is, the random field contains a white component. 

Of interest is also the other extreme, in which a constant is added to the correlation 

/Weak Prior (r) = ^/constO) + /?/* (r) (5.71) 

for any valid correlation /* . It is clear that the prior variance 

varO(x)) = a 2 /weakPrior(0) = (J 2 (a + (3) (5.72) 

increases as a increases, meaning that the prior mean is asserted ever more weakly. 
On the other hand, the statistics of pixel differences 

Var(z(a) - z(b)) = a 2 [2 /weak Prior (0) - 2/weak Prior (|« ~ k\)] 

= 2f3* 2 [f40)-f4\a-b\)] ' 

are independent of a. Thus our spatial statistics are left untouched; only the prior 
mean is increasingly ignored. As a — > oo, the solution to the estimation problem 
becomes equivalent to kriging, however the poor numerical conditioning associated 
with large a makes such an approach impractical. 

The key problem in directly modelling the covariance of random fields is that the 
above correlation functions do not generalize to the nonstationary case. That is, using 
any of the above correlations with a space- varying normally leads to covariances 
which fail to be positive- semidefinite. 



5.6.2 Analytical Forms and Nonstationary Fields 

Arbitrary spatial variations in are not possible, since the resulting covariances fail to 
be positive-definite; however there are certain spatial variations which are permitted. 

Independent Blocks 

If the spatial nonstationarities occur in disjoint, independent regions, each of which 
is stationary, then the overall covariance is positive-definite. Specifically, if field Z 
on lattice Q is divided into regions {R\ , . . . , R q } such that 

RiHRj =0 i + 3 UiRi = Q (5.74) 

then if we lexicographically order by region, the resulting field covariance is block- 
diagonal: 



z = 
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\Z(Ri. ™ 



\Z{R q 



m 



(5.75) 



If each of the blocks is positive-definite, then so is the block-diagonal matrix. This 
approach is, however, of very limited interest, since most images and random fields 
are not piecewise-independent. 

Spatially Varying Variance 

If the nonstationarities are confined to spatial variations in the random field vari- 
ances, and if we select a positive-definite, stationary correlation-coefficient structure 
/(), then the model remains positive-definite. That is, if we define a stationary field 

zS s) ~ P > where P {j = f(\i - j\) (5.76) 

and then define a nonstationary field z_( ns \ where each element of z^ ns ^ is a rescaled 
version of z^: 



z 



(ns) 



Dmg(a)zj s) = DzS s) ~ DPD T > 0. (5.77) 



The resulting covariance is thus made up of nonstationary (variance) and stationary 
(correlation) components: 

E[z(x)z(x + 5)]-E[z(x)]E[z(x + 5)]= <t^x+S_ ' /(^) ' < 5 ' 78 ) 

Nonstationary Stationary 

Alternately, we can explicitly write the covariance of the field as 

P= (aa T )©P , (5.79) 

where we see P expressed as a stationary structure P , modulated by nonstationary 
standard deviations. 

Nonstationary Mixtures of Stationary Fields 

In many situations the correlation structure / is itself nonstationary, in which case 
modelling is somewhat more difficult. Such situations commonly occur in recursive 
dynamic estimation in which measurements are assimilated in multiple steps over 
time. 

Given a correlation structure /(r, 6), if varies with spatial location, the resulting 
covariance is almost certainly not positive-definite. However, it is possible to use 
a nonstationary mixture of stationary models [191], an approach most effective if 
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the modelled correlation structure has a single form (e.g., exponential), but with a 
space-varying correlation length 0. That is, for each unknown element Zj, suppose 
we have an associated idealized parameter setting Oj . The idea is to generate q sets 
of estimates {z(m\©i)} and possibly error covariances P(rn\0i), each based on a 
stationary positive-definite model / characterized by one of @i, . . . , © q . 

The desired non stationary estimates and error variances are then found as a nonsta- 
tionary linear combination of the stationary ones; that is, 

z J =^a i (0 J )z J (m\O z ) (5.80) 

i 

Pjj= , E l 0i(O j )P jj {m\e i ). (5.81) 

i 

The interpolating weights on , ft satisfy 

^a*(0) = l ^ft(0) = l V0 (5.82) 

i i 

to prevent biasing the estimates, and 

ai{O k ) = {\ % = k i ft(0 fe )=r ■;-'" (5.83) 




to require the exact solution to emerge when the desired 6 equals one of the {Oi}. 
The weights themselves can be solved by least squares to minimize the estimation er- 
ror [191]. A brief illustration of this method can be seen in Application 4 on page 122. 



5.6.3 Recursive / Dynamic Models 

The previous two sections examined the explicit construction of a covariance P, 
which has the advantage of specifying an unambiguous statistical model, but the 
disadvantage of being a dense matrix for which positive-definiteness must be guar- 
anteed. 

However, in the same way that the constraint matrix L, the matrix square root of 
the system matrix Q, has no positivity requirements, the same is true of the matrix 
square root for the field covariance P or correlation structure V. In particular, we 
could specify a random field z as 

z=[T*W] : => P = T*r T (5.84) 

z = Tw => P = rr T , (5.85) 

where w and W are white unit- variance random fields. 
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Such a representation indeed frees us from positive-definiteness requirements, how- 
ever r remains large and dense and, in contrast to P, is difficult to interpret or to 
specify. Computing r directly from P may be impractical for large problems, de- 
pending on the sparsity of P (see Section 9.1.2), because of the computational com- 
plexity associated with finding matrix square roots. 

A more efficient alternative is a recursive-dynamic model for the square root. Given 
a dynamic model 

z(t + 1) = Az(t) + Bw(t), (5.86) 

or possibly its space- stationary variant, 

Z(t + 1) = A* Z(t) + B * W{t), (5.87) 

plus an initial condition cov(z(0)) = P 09 then the process covariance is implicitly 
specified and guaranteed to be positive-definite. 

Like L, A and B are typically sparse and subject to no other conditions. 4 Furthermore 
the Kalman smoother can be used to generate estimates z(t\T) based on all of the 
measurements scattered throughout the dynamic process. 

The challenge, clearly, is how to determine a partitioning of the problem such that 
the partitions obey a dynamic relationship. That is, given a multidimensional random 
field Z, how can we define partitions z(t) = S t Z : such that the resulting dynamics 
on z(t) capture the statistics of Zl 

The clear benefit of such a scheme, of course, is that the individual states z(t) are 
much smaller than the entire process Z : , thus the size of dynamic A is much, much 
smaller than the single constraint matrix L which it replaces. 

The answer to the partitioning question can depend subtly on the statistics of Z, 
and can most definitively be answered for the broad class of Markov random fields, 
discussed in Chapter 6. 

A number of partitioning schemes are examined in this book, in particular a parti- 
tioning into adjacent groups (e.g., the partitioning of an image into rows or columns) 
in the marching methods of Section 10. 1, and the hierarchical partitioning into nested 
subsets in the multiscale methods of Section 10.3. 



5.6.4 Banded Inverse- Co variances 

Motivated by the sparse-matrix methods described in Section 5.3, it is tempting to 
consider representing covariance matrices by a sparse, banded representation. How- 
ever, a 6-banded covariance P implies that any state element is correlated with only 



4 The reader may wonder about stability requirements for A; however for a domain of finite 
size stability is not an issue. 
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(a) 5-Banded Matrix 




(b) 5-Banded Matrix Inverse 



Fig. 5.18. The inverse of a sparse matrix is normally dense, with enormously complicated 
structure. 



b — 1 other elements, and completely decorrelated with the entire remainder of the 
field. For example, a standard five-banded covariance for a two-dimensional ran- 
dom field would imply that any pixel is correlated only with its immediate four 
neighbours. Thus for any reasonably sized b, the spatial extent of correlation will 
be very modest, much shorter than would arise in realistic contexts. For those rare 
locally-correlated problems it is much simpler to use a local estimator, so banded- 
covariances are of little interest. 

Furthermore, to take a given covariance and to force it to be banded, by setting all 
values outside of the bands to zero, almost certainly results in a matrix which fails to 
be positive-definite. 

Banded inverse covariances are a completely different matter, however. In particular, 
because the inverse of a sparse matrix is normally dense, a sparse P _1 can imply 
a dense P — that is, with all elements correlated with one another, as is illustrated 
in Figure 5.18. Indeed, even a five-banded P _1 can model arbitrarily strong correla- 
tions between arbitrarily distant state elements. Additional bands do not necessarily 
lead to stronger correlations, rather to more interesting variations, patterns, and tex- 
tures. 

However, the benefits of sparse banded inverses go far beyond the convenience of 
simplified covariance storage. 

Indeed, the singular key benefit of most 5 sparse inverses is that they imply a certain 
decoupling of the random process, meaning that the original field can, relatively 
easily, be divided into smaller pieces, a key desirable property when considering 
domain decomposition methods. This large, important class of statistical models is 
known as Markov random fields, the topic of Chapter 6. 



Not all sparse inverses have simple or useful decoupling properties. Normally the process 
is reversed: we specify the nature or degree of decoupling, which determines the class of 
sparse inverses to be considered. 
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Even outside of an explicit consideration of sparse inverse-covariances this class 
of models is significant. The recursive-dynamic models of Section 5.6.3 possess a 
Markov decomposition property, by construction, and therefore also have sparse in- 
verses, although this inverse is normally not computed and stored explicitly. Simi- 
larly the deterministic models of Section 5.5 are characterized by sparse L or L T L, 
which we understand to correspond with P _1 , and are therefore also Markov. 



5.7 Model Determination 



The problem of inferring a model is known as system identification [212], a complex 
task whose details and subtleties are outside the scope of this text. 

In many cases, if the multidimensional measured system is a real, physical one, then 
the behaviour of the system will be described by a specific mathematical or physical 
formulation, whether an elliptical partial differential equation with boundary condi- 
tions, or a regular differential equation over time, which may then be discretized. 

However, even for physical systems with a known mathematical model, there are 
circumstances in which the mathematical model fails to lead to a meaningful prior: 

• The mathematical description may live at a scale far finer than the intended dis- 
cretization of the problem. 

• The dimensionality of the mathematics may be different from the intended estima- 
tion problem. For example, the three-dimensional behaviour of water in the ocean 
is governed by the Navier-Stokes equation, which is of limited use in forming a 
prior of the two-dimensional ocean surface. 

In these and many other cases we will have training data, but not a physical basis for 
knowing a model, although our understanding of the problem may allow us to assert 
whether the model is stationary, periodic at the boundaries, or isotropic, for example. 

Under modelling we normally understand the inference of a model with relatively 
few parameters for representation. A very large model is much more likely to lead to 
overfitting, although methods of cross-validation (Section 2.4.1) can be used to test 
for that. In practice, for large multidimensional problems we would prefer to learn 
a parametrized model, meaning a statistical model (P(0),C(Q),R(6)) with some 
number of unknown parameters 0. 

There are variations in terms of what is known: 

• Ground truth data Z for the state Z, or 

• Some number of measurements sets M, 
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and what is not known: 

• The prior model P or regularization constraint L, and/or 

• The forward problem C, and/or 

• The statistics of the measurements R. 

It is quite common for the physics of the problem to dictate C and R, leaving the prior 
model to be inferred. In many cases a Markov model is chosen, as just discussed in 
Section 5.6.4, and for which specialized methods of model inference are discussed 
in greater detail in Chapter 6, in Section 6.6. 

Outside of the Markov case, we need to estimate from the given data the parameters 
6, parameterizing a prior model 



Z ~ A/"(0, P(6)) Z ~ A/"(0, V(0)) or Z ~ L(8) 



(5.88) 



and so on, as appropriate. For example, all of the analytical positive-definite forms 
in Section 5.6.1 are a function of a single scalar parameter 6. We rarely have a prior 
model for itself, so the unknown parameters are estimated using maximum like- 
lihood (2.93): 

§ = arg^ maxp(Z|£) or § = arg# max p(M\0) (5.89) 

depending on whether ground truth Z or measurements M are given [248]. 

In many cases, especially when measurements are given, it is exceptionally difficult 
to even write down p(M\9), essentially because the relationship between and M 
goes via (unknown) Z. In such cases the expectation-maximization algorithm [25, 
85, 275], also discussed in Section 7.5, is widely used for parameter estimation. 



Example 5.2: Model Inference 



We are given a sample Z of a random field Z and its associated correlation: 




Sample 




36-^10- 
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Offset 

Correlation 



Example continues ... J 
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Example 5.2: Model Inference (cont'd) 



In principle we could attempt to infer a model directly from the correlation, how- 
ever the way in which the model parameters 6 relate to the correlation may not 
be obvious. Suppose we know the form of the model, here the exponential corre- 
lation /exp(^ 5 0) of (5.61); then we can compute the likelihood p(Z\0)\ z=z as a 
function of and select to maximize: 
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Because the three samples Zi are different, the log-likelihood curves and their re- 
spective maxima will vary somewhat from sample to sample, left. If, furthermore, 
the sample M = Z + V is noisy, right, then the accuracy of the estimate decreases 
at greater noise levels. 

It is more likely that our model uncertainties would include correlation, variance, 
and measurement noise variance. These parameters can, in this simple case, be 
inferred from an autocorrelation: 




By extrapolating a process variance of w 5.1, the measurement noise variance 
appears as the increased height of « 1.1 at zero-lag. 

Instead, we could have estimated all three parameters in using maximum like- 
lihood, an optimization over a three-dimensional space, for which an iterative 
method such as EM would typically be used. 
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Fig. 5.19. We have seen eight forms of representation, left, where the forms differ based on 
the choice of three underlying aspects: representing a model or its square root, determinis- 
tic constraints versus statistical modelling, and a full/dense matrix versus a sparse matrix or 
convolutional kernel. 



5.8 Choice of Representation 



This chapter has developed a total of eight forms of representation: whether con- 
straints or statistics, square root or squared, and whether a sparse matrix form or a 
stationary kernel, as illustrated in Figure 5.19. 

The last of these three questions is perhaps of lesser significance, to the extent that 
kernels are just an implicit, highly sparse representation of a matrix. If the problem is 
highly nonstationary, then a regular sparse matrix approach is simpler. Alternatively, 
if the problem is mostly stationary, then it may be easier to store a kernel and to keep 
track of nonstationary exceptions (such as at boundaries and discontinuities). 

The former question is more significant. First, whereas all L,C,A,T are valid, 
the direct specification of L T L, Q, P must guarantee positive-semidefiniteness. Al- 
though this requirement is easily satisfied for stationary problems, for nonstationary 
problems definiteness may be much more subtle. 

In general a covariance P will be dense, and the detailed discussion of sparse statis- 
tical models is deferred to Chapter 6. For a constraints-based model, if the number 
of diagonal bands in L, or equivalently the number of nonzero entries in £, is b, then 
Q = L T L may have up to b 2 bands. Therefore for zeW 1 , the product x = Qz has 
complexity G(b 2 n), whereas separately computing 



U = Lz, 



L T U 



(5.90) 



has considerably reduced complexity 0(2bn), but at double the storage complexity, 
since y_ needs to be stored in memory. 

In those cases where the computational complexity is not prohibitive, the reduced 
storage demands and the simpler, more homogeneous structure of Q versus the 
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Fig. 5.20. A domain-decomposition approach to phase unwrapping [53]: Given an interfero- 
gram (left) we can divide it into regions and unwrap each region individually (center), where 
then the regions can be unwrapped against each other (right) to form the final image. 



stacked sets of constraints in L (as in (5.47)) may lead to a preference for Q, Q 
over L, C. 



Application 5: Synthetic Aperture Radar Interferometry [53] 



A fascinating problem in image processing and remote sensing is that of radar inter- 
ferometry [134, 138]. 

The complex radar return at a pixel has a phase proportional to the target distance 
at that pixel. It is possible to take two radar images, either from one satellite (e.g., 
Radarsat) from two precisely-defined orbital positions or, preferably, from a pair of 
satellites (e.g., ERS-1/2) having a precise orbital offset. It is possible to design the 
orbital positions of the imaging satellites such that the complex phase difference be- 
tween the two images is proportional to height, allowing detailed topographic maps 
to be made from space. An example of a resulting phase-difference image, known as 
an interferogram, is shown in Figure 5.20. 

We would like to recover the absolute phase difference </>, however the measured 
phase difference m = c{4>) is wrapped, meaning that — n < c{4>) < n. The solution 
to the unwrapping problem is therefore some integer multiple of 2tt: 



(j) = c{<t>) + 2ttz. 



(5.91) 



It is possible to estimate z by counting the interferometric fringes, which are quite 
obvious in Figure 5.20, however real data are subject to noise and poor radar co- 
herence, in which case the fringes are far less well defined. Integer (network- flow) 
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Fig. 5.21. The domain-decomposition phase unwrapping from Figure 5.20 applied to ERS-1/2 
interferometric data from Mt. Etna [53]. The phase data are ignored in areas with low radar 
coherence, leading to the gaps in the regions and reconstruction images, and only those gaps 
are interpolated in the final image. 



optimization methods have been used with considerable success [72], however there 
are issues of computational complexity in solving problems on a very large domain. 

Motivated by the domain-decomposition ideas illustrated in Figures 5.2 and 5.5, Fig- 
ure 5.20 shows the original interferogram being divided [53] into modestly- sized re- 
gions, with each region unwrapped individually, followed by unwrapping the regions 
against each other. 

The approach has been applied to large problems and real data, one example of which 
is illustrated in Figure 5.21 for ERS data. 
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For Further Study 

There is relatively little material on the subject of multidimensional modelling; in- 
deed, it is precisely this gap which motivated the writing of this text. 

Two classic papers by Terzopoulos [303, 304] examine methods for solving inverse 
problems given stationary kernels, with exceptions for folds and discontinuities, very 
much along the lines of aspects of this chapter. A related paper is that by Szeliski 
[299], although that paper is more easily read after Chapter 8. 



Sample Problems 

Problem 5.1: Membrane Kernel 

Consider a 5 x 5 random field Z. Create the first-order, free-boundary 20 x 25 
constraint matrices L x ,L y , leading to the overall first-order constraint 



U 



Compute the squared constraints matrix Q = L±L 1 , and lexicographically un- 
wrap the middle row of Q to see the membrane kernel; that is, compute 

[Ql3,1...25] 5x5 

Try unwrapping other rows of Q, such as the first and the last, and comment on 
what you observe. 

Problem 5.2: Membrane Kernel 

Repeat Problem 5.1 for the second-order thin-plate case. That is, create a con- 
straints matrix 



U 



XX 

2L, 



J xy_ 

and compute Q = L\L 2 . Unwrap the middle row of Q, along with a few other 
rows, and state your observations. 

Problem 5.3: Matrix Truncation 

In most circumstances, taking a covariance P and keeping only a certain number 
of diagonal bands leaves you with a matrix which is not positive-definite, and 
which is therefore invalid as an approximate prior model for estimation. Let 
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P b = bands(P, b) 

be a matrix which copies the b bands on either side of the principal diagonal, and 
sets the remaining entries in P b to zero. 

The positive-definiteness of P b can be determined from its eigendecomposition; 
investigate the positive-definiteness of P b (cr) as a function of b and a for two 
different covariances: 

(a) Gaussian: Let P be a 20 x 20 covariance matrix, corresponding to a one- 
dimensional process of length twenty, such that 



«»^ P (-<y£) * W -H(-^) '•-' 



<b 

j\ > b 



(b) Exponential: Let P be a 20 x 20 covariance matrix, corresponding to a 
one-dimensional process of length twenty, such that 

PyW =exp(-tl) *,(*)-[->(-*?) ]!-'!*» 

V <r J { |z-j|>6 

A look at Problem 2.4 may help to explain what is happening here. 

Problem 5.4: Two-Dimensional Interpolation 

Suppose we have a 20 x 20 two-dimensional grid Z. Suppose also that we have 
seven measurements of individual points Zij : 

i 5 15 5 15 3 10 17 
j 2 2 10 10 18 17 18 
m 15 10 5 20 10 20 

Thus in the measurement model m = Cz + v, matrix C is a 7 x 400 array. The 
measurement errors are independent and have a variance of 9, thus 

cov(V) = R = 9 • I. 

We define four cases of regularizing constraints: 

1. First-order constraints along i and j, free boundaries 

2. Second-order constraints along i and j, periodic boundaries 

3. First-order along z, second-order along j, a discontinuity (cut) from (10.5,1) 
to (10.5,10), periodic boundaries at i = 1, 20, free boundaries at j = 1, 20 

4. Second-order along i and j, a fold from (10,10) to (10,20), periodic bound- 
aries at j = 1, 20, free boundary at i = 1, zero boundary at i = 20 
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For each of these four cases, do the following: 

(a) Compute the constraints matrix L and then compute Q = L T L. Plot the 
structure of Q (using spy (Q) in MATLAB), and interpret. 

(b) Choose one or two rows from Q and lexicographically unwrap them to see 
the associated kernels; interpret the results. 

(c) Select an appropriate value of A (by trial-and-error or by cross-validation), 
compute estimates Z, and state your observations. 

Problem 5.5: Open-Ended Problem — Nonstationary Modelling 

Suppose we wish to consider a global remote- sensing problem, such as the Geoid 
of Figure 5.12, the altimetric problem of Figure 1.3, or the ocean-temperature 
problem of Figure 4.1 and Application 8. 

The problem is very large-scale, nonstationary, with periodic boundaries. We 
wish to construct a prior model subject to the following: 

1 . We wish to model the Earth on a rectangular lattice, gridded in latitude and 
longitude, with lattice points one degree apart, with a latitude range from 60S 
to 60N. 

2. The smoothness constraint is second order over water, state elements over 
land should be removed. 

3. The west-east boundaries need to be periodic; the north-south boundaries 
should be free. 

4. The spatial correlation length will be stationary, however because the lattice 
is gridded in lat-long, and lines of latitude converge at the poles, therefore 
the lattice model is non stationary. Specifically, the L yy terms are stationary 
(lines of longitude are parallel), however the L xx terms need to be scaled. 

5. Clearly, the entire model needs to be positive-definite. 

Because the model is large and nonstationary, we want to specify the constraints 
L, and not the weight matrix Q. 

(a) Draw a sketch or explain in simple terms how to go about formulating L. 

(b) Implement code to generate the product Lz. 

(c) Optional: for those readers who have read or are familiar with the linear- 
systems and preconditioning methods of Chapter 9, generate estimates based 
on L given real or artificial data. 



Markov Random Fields 



Based on the modelling discussions of Chapter 5, the issues of computational and 
storage complexity for large problems have motivated an interest in sparse represen- 
tations, and also in those models which allow some sort of decoupling, or domain 
decomposition, to allow a hierarchical approach. As we shall see, both sparsity and 
domain decomposition are at the heart of all Markov processes, thus the topic of 
Markovianity is central to the modelling and processing on large domains. 

Fundamentally, all things Markov [127, 276] have a conditional independence or 
conditional decorrelation property. For example, given three pieces of a random field, 



Z A 


Zb 


Zc 



(6.1) 



then B is a Markov separation of A and C if it contains all of the information needed 
by A about C: 

p(Z A \Z B ,Z c ) =p(Z A \Z B ) p(Zc\Z B ,Z A )=p(Z c \Z B ). (6.2) 

That is, given B, knowing C tells us nothing more about A, nor knowing A any more 
about C. In other words, having determined B (somehow), we are free to consider 
A and C separately: the problem has been decoupled! 

This Markov property is a common phenomenon in familiar random processes, it 
conveniently extends to two- and higher-dimensional fields, and it leads to rather 
flexible models and algorithms. 

This chapter summarizes the properties of Markov models, first in one dimension and 
then in higher-dimensional cases, and discusses methods of Markov model inference. 
How to use the Markov decoupling properties and how to compute estimates based 
on Markov priors is the subject of Chapters 7 through 10. 
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Fig. 6.1. The past and the future of one-dimensional Markov processes are conditionally sep- 
arated by some information of the present. In continuous time, the required information is the 
state value, and some number of derivatives, at a single point in time t . In discrete time, the 
"present" consists of some number of consecutive samples. 



6.1 One-Dimensional Markovianity 

A one-dimensional random process z(t) is Markov (Figure 6.1) if the knowledge of 
the process at some point z decouples the "past" z p and the "future" Zf\ 

p(zf\z ,z p ) =p(z f \z ), p(z p \z ,z f ) =p(z p \z ). (6.3) 

The specifics of what these terms really mean are dependent on the order of the pro- 
cess, and whether we are working with continuous-time or discrete-time processes. 
In continuous time, if a process is decoupled about a given point t 

z p = {z(t)\t<t } 

(6.4) 

Zf = {z(t)\t>t }, 

where the information kept at t is the process and its first (n — 1) derivatives, 

z = [z(t ), zW(t ), . . . , z^-^ito)} , (6.5) 

then z is said to be nth-order Markov. Similarly, in the discrete-time case, n succes- 
sive process samples are needed to separate an nth-order Markov process: 

z v = {z(t)\t <t -n} 

z f = {z(t)\t>t } (6.6) 

z = {z(t Q - n + 1), z(t - n + 2), . . . , z(t )}. 

Clearly these two definitions for continuous-time and discrete-time are mutually 
compatible as the discretization interval goes to zero, since the first n — 1 synthetic 
derivatives of a process can be computed from n successive samples. 

One-dimensional Markov processes are generally of limited interest in solving mul- 
tidimensional problems, so we focus on the generalization of these ideas to higher 
dimensions in Section 6.2. There are, however, two well-known classes of one- 
dimensional Markov processes, discussed below. 
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6.1.1 Markov Chains 

The special case of first-order, one-dimensional, discrete-time, discrete- state Markov 
processes is also known by the more familiar name of Markov chains, which were 
introduced in Section 4.5.1. 

The first-order and discrete-time nature of the process implies that 

p(z(t + l)\z(t), z(t - 1), z(t - 2), . . .) = p(z(t + l)\z(t)) (6.7) 

and the discrete- state nature of the process implies that, rather than a continuous 
probability density, we have a discrete set of state-to- state transition probabilities 

dij = Pi(z(t + 1) = si E &\z(t) = Sj E &). (6.8) 

Although Markov chains provide a helpful intuitive connection, in practice they con- 
tribute relatively little to image modelling and estimation: 

• The number of states tends to be impossibly large, in practice. If the Markov chain 
represents a set of n eight-bit grey-level pixels, then the number of states is 256 n . 

• The one-dimensionality and causality of Markov chains is too limiting to apply to 
multidimensional contexts. 

We do, however, return to Markov chains in looking at hidden Markov models in 
Chapter 7. 



6.1.2 Gauss-Markov Processes 

A random process is nth-order Gauss-Markov if it is nth-order Markov and if the 
elements of the process are jointly Gaussian. As an example of key importance, our 
canonical dynamic model (4.1) of Section 4.1 is, by construction, jointly Gaussian 
(based on its linearity and Gaussian driving noise) and, in fact, is also first-order 
Markov. Indeed, Markovianity is explicit in its recursive form: 

z(t + 1) = A(t)z(t) + B(t)w(t) w(t) ~ A/"(0, J), (6.9) 

which implies that given z(t), the uncertainty remaining in z(t + 1) is due entirely to 
w(t). The noise term w(t) is uncorrected with z(t — l),z(t — 2), . . . therefore none 
of z(t — 1), z(t — 2), . . . have any additional information regarding z(t + 1) and thus 
the conditional probability density decomposes as 

p(z(t + l)\z(t),z(t - 1), . . .) = p(z(t + l)\z(t)). (6.10) 

By extension, all such recursive-Gaussian models are Gauss-Markov, examples of 
which are listed below and plotted in Figure 6.2: 
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White Noise 
(Oth Order) 



Brownian Motion 
(1st Order) 



Autoregressive 
(2nd Order) 



Autoregressive 
(3rd Order) 



Fig. 6.2. The order of a Markov process counts the number of degrees of freedom (process 
value plus derivatives) needed at some point, such as the dashed line, to characterize the pro- 
cess at that point. The order is normally related to process smoothness and complexity. 



White noise x(t) = w(t) Oth-order Gauss-Markov 

Brownian motion x(t) — x(t — 1) + w(t) lst-order Gauss-Markov 

Autoregressive x(t) — J27=i a (i) x (t ~ + w (t) nth-order Gauss-Markov 

(6.11) 
With the connection established between Markovianity and linear dynamic models, 
the discussion of Section 4.1.1 becomes germane here. In particular, we saw that an 
nth-order scalar autoregressive or nth-order scalar moving average process, both of 
which are nth-order Markov could, via state augmentation, be rewritten in first-order 
n- vector form (6.9). Thus we understand that the notion of Markov order is some- 
what fluid, and furthermore that the modelling of higher-order Markov processes is 
not necessarily inherently difficult, but may instead just require an increase in state 
dimension. 



6.2 Multidimensional Markovianity 



The generalization of Markov decoupling to two and higher dimensions is known 
as a Markov Random Field [62, 207, 335]. The immediate difficulty is that the one- 
dimensional concepts of "past" and "future" are lost, inasmuch as there is no natural 
ordering of the elements in a multidimensional grid. Instead, a random field z on 
a lattice (or grid) Q is Markov (Figure 6.3) if the knowledge of the process on a 
boundary set b decouples the inside and outside of the set: 



p(zi\z b ,z ) =p(Zi\z*), p(z<,\z*,Zi) =PU \z h ). 



(6.12) 



This boundary concept, although elegantly intuitive, is lacking in details: how "thick" 
does the boundary need to be? Five examples of boundaries, with varying degrees of 
ambiguity, are shown in Figure 6.4. 
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Boundary 




Outside 



Fig. 6.3. The notion of Markovianity extended to higher dimensions: A multidimensional 
random process is Markov, with respect to some boundary, if the process within the boundary 
(shaded) decouples the process inside and outside. 
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(b) 
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(c) 



(d) 



(e) 



Fig. 6.4. The "thickness" of the boundary is clear enough in cases (a) and (b), but how do we 
define the thickness of a diagonal boundary (c,d) or in more heterogeneous cases (e)? 
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(a) 



(b) 



Fig. 6.5. (a) The simplest neighbourhood structure for a two-dimensional Markov process: 
Any pixel (black dot) is decoupled from the rest of the domain when conditioned on its four 
immediate (shaded) neighbours, (b) The definition of two-dimensional neighbourhood order: 
for an nth-order model, the neighbourhood conditionally separating the shaded region from 
the rest of the domain contains the labeled pixels from one to n. The neighbourhood in (a) is 
therefore first order. 



It is often simpler, and more explicit, to talk about decorrelating or decoupling a 
single lattice site j from the rest of the lattice Q\j by conditioning on a local neigh- 
borhood ATj ; [86,127]: 



p(zj\{z k ,k e fi\j}) =p(zj\{z k ,k eATj}), 



(6.13) 



as illustrated in Figure 6.5. For a neighbourhood to be valid it must satisfy two simple 
requirements: 

1 . A site is not its own neighbour: j £ Mj . 

2. The neighbourhood property must reciprocate: 

jeJ\4 <=$ k eJVj, (6.14) 

from which it is clear that the random field is loopy, in the sense of Figure 5.1, 
which will lead to challenges in computation. 



In most circumstances the shape of neighbourhood Mj is not a function of location j. 
For the neighbourhood structure to be stationary, the reciprocating property demands 
that the neighbourhood be symmetric: 



j + 5 eAfj => j - S e Mj 



(6.15) 



for any offset 5, a property which holds true for common neighbourhood structures. 
The shape and extent of Mj , measured by the order of the neighbourhood, is one of 
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Fig. 6.6. From Figure 6.5 it follows that the left and right panels show first- and second-order 
boundaries, respectively. 



the fundamental properties characterizing a random field; a first-order neighborhood 
is shown in Figure 6.5, along with the numbering scheme for higher-order neigh- 
borhoods. It is crucial to understand that the size of image structure which can be 
realized by a Markov model is not related to model order or neighbourhood size — 
a first-order field could exhibit long correlations, and a fourth-order field possibly 
short ones — however the complexity of the realized structure increases with order. 
Thus the first-order neighbourhood of Figure 6.5(a) can lead to models in which 
widely- separated pixels are highly correlated (what we mean by a "large" structure), 
but it is limited to only very simple patterns or textures. 

The above discussion is entirely formulated in terms of the probability density p(z), 
which is impractical for large random fields; the following sections summarize the 
two broad alternatives which have been developed: 

1. Gauss-Markov Random Fields: z is Gaussian, in which case the field can be 
characterized explicitly in terms of expectations rather than probability densities. 

2. Gibbs Random Fields: an energy H(z) is associated with each possible field z; 
a probability density is then constructed implicitly from H(z) in such a way that 
p(z) satisfies (6.12), (6.13). 



6.3 Gauss-Markov Random Fields 



So far, all notions of domain decomposition have been written in terms of probabil- 
ity distributions and conditional independence, which are very inconvenient numeri- 
cally. Instead, a related notion is that of conditional decorrelation, which implies the 
absence of any linear relationship, as opposed to conditional independence, which 
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implies the absence of any relationship. In the Gaussian case [61], meaning that z is 
jointly Gaussian, decorrelatedness and independence are equivalent. Because corre- 
lations are much simpler than probability densities, we are motivated to consider this 
more limited notion of Markovianity. 

Given a neighbourhood structure {A/}}, we say that a random process z is Strongly 
Markov [86] with respect to {A/}} if z is conditionally independent as 

p(zj\{z k ,k e f2\j}) =p(zj\{z k ,k eAfj}). (6.16) 

This is the definition of Markovianity, copied from (6.13), that we have used up to 
this point. 

Next, we say that a random process z is Weakly Markov [86] with respect to {A/} } if 
z conditionally decorrelates as 

E[ Zj \{z k ,kef2\j}] =E[z 3 \{z k ,keM 3 }}, (6.17) 

in contrast to (6.13). We recognize such a conditional expectation as the Bayesian 
estimate (3.46) of Zj 9 which is linear in the Gaussian case, but nonlinear in general. 

Formulating a nonlinear Bayesian estimate is really not of interest here, rather we 
seek a simple alternative to Markov fields based purely on second-order statistics. 
Therefore the much simpler, and more widely used, definition is that of a Wide-Sense 
Markov random field [86] with respect to {A/}}: 

Zj =EusR[zj\{z k ,ke f2\j}] = E LL sE[zj\{z k ,k e Afj}], (6.18) 

where £?llse explicitly refers to a linear expectation, the best linear estimate of Zj , 
as discussed in Section 3.2. In the Gaussian case Markov, weakly Markov, and wide- 
sense Markov are all equivalent. 

It is also true that a zero-mean random process z is wide sense Markov with respect 
to {A/}} if 

Zj = ^2 Sj,kZ k + Wj E[z k Wj] =0 V j ^ k. (6.19) 

k£Afj 

The equivalence of these two forms is easily seen: 

(6.18) => (6.19): £?llse is a linear estimator for Zj. From the left-hand side of 
(6.18) this estimator is global and is therefore the optimum linear estimator. 
Therefore by orthogonality the estimation error Wj will be uncorrelated with 
all coefficients z k . The right-hand side of (6.18) asserts that the estimator 
can be written locally, within the neighbourhood, as in (6.19). 

(6.19) => (6.18): The expression (6.19) defines a linear estimator 

% = ^2 Sj,kZ k , (6.20) 

keN'j 
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having an estimation error 

Zj = Zj — Zj = — Wj (6.21) 

meaning that Wj obeys orthogonality. By definition this must therefore be 
the linear least-squares estimator (LLSE), implying (6.18). 

This equivalence is particularly useful, because it relates a conceptual notion of de- 
coupling (6.18) with an associated linear estimator in autoregressive form (6.19). 

It is particularly illuminating to derive the statistics of w and z from (6.19). First, 
rewrite (6.19) as 



k eMj 
1 j = k (6.22) 

otherwise 



^2gj,kZk = wj 9j,k = < 

k 

If we let the noise variance war(wj) = <r|, then for two sites j ^ k, 

E[wjWk]=E Wj^2g k ,iZi (6.23) 

= g kij E[wjZj] = g k ,jE[wjWj] = g k ,jtf (6.24) 

from which the reciprocity requirement (6.14) on the neighbourhood follows: 

E[wjW k ] = E[w k Wj] => 9k,j<fj = 9j,koi' ( 6 - 25 ) 

Thus we see that the noise process w is not, in fact, white: there are off-diagonal 
correlations. Indeed, it has a correlation structure given by the linear estimation co- 
efficients g, where the sparseness and locality of the correlation structure are exactly 
those of the random field neighbourhoods. 

If we lexicographically reorder z, w into vectors, as usual, then 

(6.22) => Gz = w 
(6.24) => cov(^) = GE , 

where G collects the estimator coefficients {gij} 9 and U w = Diag(cr^, <j\, . . .) is a 
diagonal matrix with the noise variances cr| along the diagonal. Although GU gives 
the appearance of being asymmetric, the constraints relating g and a 2 in (6.25) do 
ensure the required symmetry of cov(w). 

Because GE is the covariance of a random vector it must (for any valid model) be 
invertible, from which it follows that G is invertible, thus 

z = G^w => cov(z) = G- X GEG- T = EG~ T . (6.26) 
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Fig. 6.7. The symmetry structure required by the LLSE kernel of a stationary Gauss-Markov 
random field. 



The most profound result, however, follows from the inverse covariance of the ran- 
dom field, 

(cov(z)) -1 =G T Z~\ (6.27) 

leading to the following significant conclusions: 

1 . Whereas the entries in a covariance describe the correlation between two random- 
field elements, the entries in a covariance inverse correspond to the associated 
linear estimator. That is, the co variance-inverse describes a model. 

2. The sparsity and locality of a matrix inverse are directly related to the neighbour- 
hood structure of the associated random field. 

3. Assuming a covariance inverse to be sparse-banded is equivalent to asserting that 
the corresponding random field is wide- sense Markov. 



The study of Markov random fields has thus led to a much deeper understanding of 
inverse-co variances, in particular the meaning of sparsity. 

Finally, we understand that a given Markov model g or G implicitly specifies the 
random-field prior P~ l = G T U~ l , thus random-field estimation proceeds as usual: 



z = (CFR^C + P" 1 ) 1 C T R- 1 m 
= {C T R- 1 C + G T U- 1 y 1 C T R- 1 m. 



(6.28) 
(6.29) 



To take advantage of the sparsity normally encountered in C, R and G, this estimator 
would be rewritten to remove the matrix inversion as 



(CFR^C + CfS' 1 ) z = C T R~ 



m. 



(6.30) 



A very important special case of Gauss-Markov random fields is stationarity, in 
which case the local estimator g and the driving noise statistics a are both location 
invariant: 

9j,k=gk-j cr 'j =cr2 =^ £ = cr 2 I, (6.31) 
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Fig. 6.8. Regions of support for causal (left) and noncausal (right) neighborhoods of the central 
element. The half plane illustrated on the left is only one of many possible choices. 



so that the random field statistics are 

cov(w) = Gcr 2 cov(z) G~ X (J 2 



E[zw T ] =Ia 2 



(6.32) 



Combining location invariance with the reciprocity symmetry of (6.25) leads to the 
constrained estimation weights 



9j — k — 9k— j i 



(6.33) 



as sketched in Figure 6.7, a constraint which essentially reflects the fact that co- 
variance matrices are symmetric. The random field stationarity then allows the prior 
model to be expressed in an efficient convolutional kernel form, in which case the 
prior term of (6.30) becomes 



G 1 £ 



-H 



l Gz = < 



g*z 



(6.34) 



The two key remaining questions, then, are how to infer the model parameters g, 
addressed in Section 6.6, and how to efficiently estimate and sample from a given 
model, addressed in Chapters 9 and 1 1 . 



6.4 Causal Gauss-Markov Random Fields 



If a random field z is causal, each neighborhood Afj must limit its support to one half 
of the plane, as sketched in Figure 6.8. These are known as nonsymmetric half-plane 
(NSHP) models [7], and lead to very simple autoregressive equations for sampling 
and estimation. 

Specifically, there must exist an ordering of the field elements, a sequential model 
in the sense of Figure 5.1, such that each element depends only on the values of 
elements lying earlier in the ordering; let 
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Example 6.1: Example Markov Kernels 



We have seen that there is a relationship between an inverse-covariance P _1 and 
MRF coefficients, and also between P _1 and constraints Q. We can therefore 
consider the Membrane or Thin-Plate kernels of Figure 5.14 as MRF models. 
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1 



Just as the correlation length or feature size of the membrane and thin-plate ker- 
nels could be varied (Figure 5.15), similarly the feature size in the MRF samples 
is not a function of neighbourhood size, but rather of the parameters in the model. 
Instead, as the neighbourhood size increases (from top to bottom) it is the com- 
plexity of the structure in the random field which increases, not its scale. 

All three sampling examples were generated using the FFT method discussed in 
Section 8.3. 
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Gz = w E[zw T ] = 1 
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A z = G- 1 A' 1 =G A W = G 
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A w = I 



A x =G- L G-' r A~ L = G 1 G E\zw T ]=G- 1 



Random Field Assertions 



Resulting Deduced Statistics 



Fig. 6.9. A broad overview comparing the assumptions and statistics of noncausal (top) and 
causal (bottom) random fields, assuming the noise variance a 2 = 1. The random field statis- 
tics, right, can be deduced or computed from the model assertions, left. In the noncausal case, 
we assert the local model G and the decorrelatedness of field and process noise, whereas in 
the casual case we assert the local model G and the whiteness of the process noise. In both 
cases the inverse-covariance of the random field is sparse. 



M < j (6.35) 

imply that all of the elements of M appear earlier in the ordering than j, that is, in 
the half-plane preceding j. Then, in parallel with the definition (6.18), (6.19) for the 
noncausal case, a random process z is causally wide-sense Markov with respect to a 
causal neighbourhood system 

Mj < j Vj (6.36) 

if 

Zj =EujsE[zj\{zk,k < j}] = E LLSE [zj\{z k ,k e Afj}], (6.37) 

or, equivalently, if 



keJ\f, 



E[z k Wj] = Vfc < j, 



(6.38) 



where the noise Wj in the autocorrelation (6.38) can now be chosen to be white. It 
is important to note the three differences between the formulation of the causal and 
noncausal cases, illustrated visually in Figure 6.9: 



1. The neighbourhood structure is causal, Afj < j, therefore reciprocity no longer 
applies. 
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2. The Markov decoupling in (6.37) is with respect to elements in the half plane, 
not with the entire domain. 

3. The estimation error w in (6.38) is orthogonal to elements in the half plane, not 
to the entire process. 

The appeal of causal models is the sequential (autoregressive) coupling of (6.38), 
allowing very fast estimation using the Kalman filter. However, in practice these 
models have limited applicability, since most random fields are not well represented 
causally. 



6.5 Gibbs Random Fields 

Gibbs random fields (GRFs) are random fields characterized by neighbouring- site 
interactions. These were originally used in statistical physics [55, 332] to study the 
thermodynamic properties of interacting particle systems, such as lattice gases, and 
their use in image processing was popularized by the papers of Besag [25,26] and Ge- 
man and Geman [127]. The neighbouring interactions in GRFs lead to effective and 
intuitive image models, and GRFs are frequently used as prior models in Bayesian 
formulations. 

Mathematically, a Gibbs distribution for a random field z is 

P(z) = \e~^\ (6.39) 

where j3 > is a constant, a "temperature" parameter, 1 and 

Z = Y j e~ pH{ -' ) (6.40) 

is a normalization constant, known as the partition function. The enormity of the sum 
prevents Z from being evaluated for all but the tiniest problems. Algorithms which 
work with GRFs avoid the evaluation of Z by focusing entirely on likelihood ratios 
or energy differences: 






exp(-/?(fT(*0 -#(&)))- (6.41) 



The well-known Metropolis and Gibbs Sampler algorithms both use this approach, 
and are discussed in Section 1 1.3. 



1 See Simulated Annealing in Section 1 1.3.1 to understand the role of temperature parameter 
(3 = 1/T in Gibbs distributions. 
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Although it would appear that we are focusing once again on probability densities, 
p(z) is, in fact, never evaluated. All inferences of z take place implicitly, strictly 
through evaluations of the energy function H(z). The energy function is normally 
written as a sum of local interaction terms, called clique potentials: 

H(z) = ^V({z l ,iec}), (6.42) 

cec 

where c C Q is a clique, either a single site or a set of interacting sites, V(-) is a 
clique potential, and C is the set of all possible cliques. Any random field z satisfying 
(6. 39), (6. 42) is known as a Gibbs random field. 

GRFs have been broadly used in statistical image processing, due to their attractive 
properties: 

1. There is a mathematical equivalence between Gibbs and Markov random fields, 
unifying these two concepts. 

2. In the linear-Gaussian case, Gibbs random fields are easily understood in the 
context of constraint-matrix L, as discussed throughout Chapter 5. 

3. A simple, intuitive energy function H can describe enormously complicated 
probability functions p. 

4. Most significantly, there exist algorithms for Gibbs random fields which extend 
easily to solving nonlinear estimation and sampling problems, problems which 
become exceptionally complex under classical Bayesian estimation. 

In particular, discrete- state problems, which are inherently nonlinear, are widely 
solved using Gibbs methods. 

We address the above properties in turn. In terms of the relationship between Gibbs 
and Markov fields, we first need to establish a connection between cliques and neigh- 
bourhoods. Suppose that Afj is a noncausal neighbourhood system on lattice f2: 

AfjCO, JiNj. (6.43) 

A neighbourhood system Afj induces a clique set C, such that any pair of points in 
c G C are neighbours: 

j,kec => j eAf k , ke Afj (6.44) 

Examples of the clique sets associated with first- and second-order neighbourhoods 
are shown in Figure 6.10. 

Then, by the Hammersley-Clifford theorem [25], the GRF and MRF are equivalent: 

z is a Markov random field with respect to Af if and only if z is a Gibbs 
random field with respect to C, where C is the set of cliques with respect to 
neighbourhood system Af. 
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• 



First-order Neighbourhood 
Associated Cliques: 



O— O 



• 



Second-order Neighbourhood 
Associated Cliques: 

° °-° I </* \ 



Fig. 6.10. The relationship between Markov neighbourhoods (top) and Gibbs cliques (below). 
A clique is a set of one or more pixels such that for each pair of pixels, the two pixels are 
neighbours. Whereas a neighbourhood (shaded) separates a center pixel (dot) from the rest of 
the domain, cliques have no notion of centre or origin. 



In retrospect this should have been obvious. Certainly the connection becomes very 
clear in the linear-Gaussian case. Suppose we have a random field z whose prior is 
implicitly specified in terms of a set of constraints L: 



L 



I 



Lz = w w~N(Q,I), 



(6.45) 



where w is white and Gaussian, as in Chapter 5. Then the prior on the random field 
is 



pU) 



1 IT 

z 




(6.46) 


]_ e -^(Lz) T (Lz) 
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(6.47) 


1 -*Eo*V 

z e l 


1 -l z Tp-l z 

= r 2 


(6.48) 



Gibbs 



Markov 



That is, a Gauss-Markov random field model directly specifies the inverse-covariance 
P _1 , whereas the Gibbs model implicitly describes the covariance square root 
through a set of constraints {^} on cliques, where each clique is just the set of pixel 
locations involved in a given constraint. 

Thus the Markov neighbourhood AT is just the union of the convolution of the clique 
sets C, in precisely the same way that the kernel Q was found as the convolution of 
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Fig. 6.11. The Hammersley-Clifford theorem establishes a connection between the clique- 
Gibbs-constraint and the neighbourhood-Markov- squared-constraint contexts. 
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Fig. 6.12. Four possible criteria, relating the degree of penalty to the size of error. The standard 
quadratic criterion, left, grows rapidly, appropriate for Gaussian statistics (where large devia- 
tions from the mean are rare), but inappropriate for problems involving outliers. An illustration 
of robust estimation is shown in Example 6.2. 



constraints C in Figure 5.8 on page 152. Indeed, the forms of the first-order mem- 
brane constraints and kernel in Figure 5.8 are clearly evident in the corresponding 
cliques and neighbourhood in Figure 6.10. The connections between these concepts 
are illustrated in Figure 6.1 1. 

So what sorts of local energy functions H are available to implicitly determine 
the random-field distribution p(z)l Certainly all of the basic smoothness models 
of Chapter 5, such as membrane, thin-plate, and other combinations of nth-order 
penalties, are all Gibbs. For example, 



H(z) = 5> 



^-i,j) 2 + 0*,j -^,j-i) 2 



(6.49) 



is a membrane energy function, implicitly describing a membrane prior for p(z). Just 
as simply, introducing measurements to (6.49) leads to 



H(z\m) = y~](zjj - rriij) 2 + (z itj - Zi-ij) 2 + (zij - ^,j-i) 2 



(6.50) 



1,3 



which now implicitly describes the posterior p(z\m) . 



However, the real strength of the Gibbs approach lies not in the representational 
power of H, which is not so different from what we have seen before, rather that 
there exist methods of sampling and estimation which apply to non-quadratic H. In 
particular, there are two broad energy classes of great interest: 



Non-Quadratic Random Fields: 

Quadratic, or least-squares, criteria are used throughout most of this text because 
of the resulting simplicity: a quadratic criterion has a unique minimum which can 
be found linearly. 
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However in the Gibbs context it is easy to introduce a non-quadratic criterion 
J (a, b). For example, to include a non-quadratic penalty term on measurement 
residuals, we could rewrite (6.49) as 

H (z) = y*]j(zij,mjj) + y*](zjj - Zi-ij) 2 + (zij - Zij-t) 2 (6.51) 

where three typical non-quadratic penalty terms are shown in Figure 6.12, and the 
use and effect of such a term is illustrated in Example 6.2. 

Of course, strongly Markov random fields can also possess nonlinear features, in 
principle, however the strength of Gibbs random fields is that relatively efficient 
methods of sampling and estimation exist for such nonlinear models, as shown in 
Chapter 11. 

Discrete-State Random Fields: 

The discrete-state case reflects the origins of Gibbs fields in quantum physics 
where each field element reflects a discrete quantum state, such as the spin of 
an atom. The most famous of all such models is the binary Ising model [335] 

H(z) = ^ Z i,3 Z i,3-± + ^2 Z hJ Z i-lJ Z hJ e i -1 > + 1 )' ( 6 ' 52 ) 

hj hj 

The simple generalization of the binary Ising form to a q- state element is known 
as the Potts model: 

ffW = E < W*.;- 1 +E<^- 1 ,; ^e{0,l,...,g-l}. (6.53) 



In the context of image modelling, discrete states are most often used in hidden 
models, where the behaviour of each pixel in an image is governed by one of a 
finite set of possible states, such as in a mixture model. Two common examples are 
image segmentation (Appendix C), the labelling of each image pixel as belonging 
to one of a finite set of segments, or image classification, associating each pixel 
with one of a set of possible classes. Hidden state models are discussed further in 
Chapter 7. 

The choice between Gibbs or Markov random fields normally proceeds fairly obvi- 
ously from the given problem. If we seek to learn a linear, second-order model from 
given data, then the Markov formulation is straightforward. If we wish to "guess" a 
model or develop it heuristically, or if our model contains nonlinearities or discrete 
random variables, then the Gibbs formulation is to be preferred. 
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Example 6.2: Robust Estimation 



We present here a very simple example, illustrating the use of non-quadratic cri- 
teria in estimation. Suppose we have a simple non-Bayesian parameter estimation 
problem, a linear regression problem given ten data points (see Example 3.1). 

If the measurement error is Gaussian with constant variance (left), then the least- 
squares criterion is appropriate and gives an accurate result. If the measurements 
have the occasional bad data point (right), inconsistent with a Gaussian distribu- 
tion, then the regression result (solid) is poor in comparison with truth (dashed). 





5 6 7 8 9 



Regression with Gaussian Noise Regression with Non-Gaussian Noise 

Instead, to limit the effect of the bad data point, we can compare a measurement 
m and its predicted location m = Cz using a non-quadratic criterion, such as the 
capped penalty from Figure 6.12 



J(m, m) = min{C 2 , (m — m) 2 } 

for some threshold (". The resulting estimates, below, are generated with a thresh- 
old C = 4, 




and are clearly superior to the previous results. This estimator is nonlinear, and 
although easily implemented for linear regression, would generally be difficult 
for large spatial problems. That the Gibbs sampler can handle such non-quadratic 
criteria is a strong asset. 
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6.6 Model Determination 

In many problems we assert a constraints matrix L, such as the membrane and thin- 
plate priors from Chapter 5. In most cases the constraints asserted in L are local, 
involving the interactions between nearby state elements, in which case the resulting 
L T L is a sparse banded matrix and we have implicitly specified a Markov prior for 
the random field. Indeed, in any problem in which we have local constraints, or in 
which the covariance inverse is modelled as sparse-banded, we should immediately 
understand the implied prior to be Markov. 

However, we are not necessarily given a banded inverse matrix or a set of constraints; 
rather in many cases we are presented with a sample random field, from which we 
would like to infer a Markov model. We can, of course, compute a sample covariance 
from the given random field, however we run into two difficulties: 

1 . The matrix inverse of the sample covariance will normally not be a nice, banded 
matrix. In most cases, the given random field is not exactly Markov, but we wish 
to approximate it as such. Moreover, even if the given random field is precisely 
Markov, the sample statistics may not be. 

2. Any sample covariance is guaranteed to be positive-semidefinite. If the sample 
covariance is furthermore positive-definite then we can compute its matrix in- 
verse, which will also be positive-definite, however truncating the matrix inverse 
to make it banded may not leave it positive-definite (see Problem 5.3). 

So our task is one of model approximation. Given sample statistics P, find a matrix 
G such that 

1. Markovianity: G is sparse-banded 

2. Validity: G is positive-definite 

3. Fit to Model: G~ x « P. 

A Markov random field is essentially a noncausal, multidimensional version of an au- 
toregressive process, thus we begin by discussing model learning for autoregressive 
processes, before tackling the related, but more complicated, problem of learning for 
Markov processes. 



6.6.1 Autoregressive Model Learning 

Autoregressive models are a classic, long-established form of time-series modelling 
[6, 37], in which we wish to approximate a given stationary time series z(t) in terms 
of an nth-order autoregressive process 
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z(t) = 2_, otiz(t — i) + w(i), 



(6.54) 



where the zero mean, white noise process w is uncorrelated with the past of z: 

E[w(t)] = E[w(t)w(s)] = a 2 5 s j E[w(t)z(s)] = if s < t. (6.55) 

The process z in (6.54) is assumed to be zero-mean; generalization to the nonzero- 
mean case is straightforward. 

Because the coefficients oti affect the relationship between z(t) and z(t — i), it seems 
plausible to suppose that the coefficients can be estimated by examining the corre- 
lation of z with shifted versions of itself. Since z is stationary, its statistics can be 
captured by a correlation kernel 



Vj = E [z(t)z(t - j)} = E [z(t - j)z(t)} = V-j 
so by substituting (6.54) into (6.56) we get 



V j =E[z(t)z(t-j)]=E 



Y,aiz(t - i) + w(t)\ z(t - j) 



\i=l 



(6.56) 



(6.57) 



jr,aiE[z(t- i)z(t -j)]+E [w(t)z(t - j)] (6.58) 



i=\ 



= ^a i V i - j + E[w(t)z(t - j)], 



(6.59) 



«=i 



where the correlation E[w(t)z(t — j)] between z and w follows from (6.55): 

j > =^ E[w(t)z{t - j)] =0 

j = => E[w(t)z(t - j)]\ =E[w(t)(J2" =1 a i z(t-i)+w(t))] (6.60) 

= + E[w(t)w(t)] =a 2 . 

Therefore we arrive at a set of linear equations, known as the Yule- Walker equations 
[37, 252], which interrelate the lag-correlations and the autoregressive coefficients: 



V J =^UiVi- j +8 j cr 2 . 



(6.61) 



i=l 



For this system of equations to be invertible we need n + 1 lag correlations, to infer 
the n AR coefficients plus the noise variance <r 2 . 

We can vectorize (6.61) by stacking the correlations and the AR coefficients for 
ij = l,...,n: 



Pi 



: 2 = ^P<1 



Vo V-i ■■■ P_(„_i 

Pi Vo ■■■ V- {n -2 
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a r , 



(6.62) 



Vn-l Vn-2 ■■■ Vo 

Given a sample process z(t) we can find estimates of the lag correlations 



1 N 



TV 
t=i 

from which we can solve for the AR coefficients a in (6.62) as 

2 = %$a => & = ^P -1 J2 
and where the noise variance is estimated from autocorrelation lag j = 0: 

a =Vq — a £. 



(6.63) 



(6.64) 



(6.65) 



The choice of n, the model order, cannot be estimated and must be asserted. In the 
same way that a higher-order polynomial can always improve the goodness of fit to 
a given number of points, similarly the variance of the autoregressive residuals 



var z(t) — 2_. OLiz(t — i) 



(6.66) 



is monotonically decreasing as n increases, therefore other criteria must be used to 
limit n. 



6.6.2 Noncausal Markov Model Learning 



The derivation of the Markov model is very similar to that, above, for the autore- 
gression coefficients. The reasons for this similarity should be made clear by an 
examination of Table 6.1. 

Our goal is as in the autoregressive case: we wish to find a stationary Markov model 
of specified neighbourhood which fits the given data [59-61, 199]. The noncausal 
Markov random field is specified by coefficients g as 



^ Qi-jZi +Wj, 



(6.67) 



ieAfj 



where the driving noise w is zero mean, not white, and uncorrected with the entire 
random process z^i ^ j. 
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O 



Autoregressive Model 


Noncausal MRF Model 


*(*) = E?=i <***(* - + ™(*) ( 6 - 54 ) 


z 3 = EiGAT,- &-i*i + ™i (6-19) 


#[iu(t)s(s)] =0 ifs < t (6.55) 


E[ziWj] = Vj/z (6.19) 


#[w(t)w(s)] =M 2 (6-55) 


E[wjWi] = gi-jcr 2 (6.24) 



Table 6.1. A comparison of the formulation and assumptions for stationary autoregressive 
models, left, and stationary Markov random fields, right. The two models are similar, however 
there are subtle differences in the noise assumptions, in that the MRF noise is correlated, 
not white, and uncorrelated with the whole of the random field, rather than the one-sided 
uncorrelation in the autoregressive case. 



Because the Markov coefficients gi-j affect the relationship between z\ and Zj 9 it 
again seems plausible to suppose that the coefficients can be estimated by examining 
the correlation of z with shifted versions of itself. Since z is stationary, its statistics 
can be captured by a correlation kernel 



V k -j = E[zjZ k ] =E[z k Zj] = Vj-k 
so by substituting (6.67) into (6.68) we get 



P j =E[z z j ]=E 



^ 9%Zi + wo I Zj 
J~] SiE[ziZj] +E[w Zj] 

ieAfo 



(6.68) 

(6.69) 
(6.70) 
(6.71) 



Once again we arrive at a set of linear equations, the analogue of the Yule- Walker 
equations for the noncausal multidimensional case: 



v 3 = Yl giPj-i + ^jCj 2 . 

ieAfo 



(6.72) 
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Let n = |A/"o| be the number of neighbours, and (N"o)i the relative index of the ith 
neighbour; then (6.72) can be vectorized, as before: 



\Mo)i 



TVo)* 



: £ = qjfl ; 



^(M))i-(AT )i ••• ^(Afo)i-(.A/"o)n 



?Vo)n-(.A/"o)i ••• TVo^-CA/b)* 



#(M)) 



#(a/- ; 



Given sample data z we can compute estimates of the offset correlations 

N 



V, 



N 2^ ZiZi+j 



(6.73) 



from which we can solve for the MRF coefficients from (6.73) as 

and where the noise variance is estimated from autocorrelation offset j = 0: 

a = v - a 2- 



(6.74) 



(6.75) 



(6.76) 



As in the autoregressive case, where the model order n could not be inferred, simi- 
larly for Markov random fields it is not possible to infer the neighbourhood J\f. The 
choice of M must either be asserted, or we can try to look at the model error 



var Zi 



ieAfj 



(6.77) 



as a function of neighbourhood size, and choose the smallest neighbourhood which 
gives an acceptable level of error. Example 6.3 illustrates the process of Markov 
model estimation. 
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Example 6.3: Markov Model Inference and Model Order 



The images below illustrate the learning of MRF models for three different prior 
models. 

Given each of the three textures from Example 6. 1, a MRF prior model is learned 
for each of four different neighbourhood orders. The following image panels show 
the sample texture synthesized from each of the learned models: 



g 
H 






Order 



Order 1 



Order 2 



Order 3 



By definition, the zero-order models have no spatial relations, and therefore are 
all white noise. 

The true membrane prior is first order, and is therefore learned adequately well 
with a first-order model, with no substantial changes at higher orders. 

The true thin-plate prior is third order, and the inadequacy of the first- and second- 
order reconstructions can clearly be seen. 

Finally the true "Tree-Bark" model is fourth order and poorly conditioned, and so 
is not properly learned above, although the third-order model is able to represent 
the vertical banding. 



Example continues ... J 
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Example 6.3: Markov Model Inference and Model Order (cont'd) 



The following two figures plot the model error variance (6.77) as a function of 
model order and the size of the sample image from which to learn the model: 



Thin-Plate 



Tree Bark 








1 16x16 

— 128x128 
H 2048 x 2048 









MRF Model Order 



MRF Mode! Order 



We can see how misleading the model error variance (6.77) can be as an indicator 
of model accuracy. Going from a zeroth- to first-order model, the thin-plate MSE 
dropped from 1.0 to 0.01, and the "Tree-Bark" MSE from 1.0 to 0.001, however 
in both cases the first-order reconstructions are quite poor. 

Instead, we could consider looking at the covariance of the learned model, and 
compare that to the covariance of the underlying true model. The following figures 
plot the difference in the learned and underlying covariance kernels, based on a 
weighted sum of absolute kernel differences: 



Thin-Plate 



Tree Bark 




MRF Model Order 



MRF Model Order 



We can now clearly see the failure of the first-order models to learn the correct 
field statistics, most strikingly in the thin-plate case, and also how a relatively 
large sample image is required before we can reliably say to have learned the true 
model. 
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Example 6.4: Model Inference and Conditioning 



We saw in Example 6.3 how the assumed model order affected the quality of 
the inferred result. However, presumably the ability to learn a model might also 
depend on the conditioning of the true model, in that a prior with high condition 
number k requires the Markov model G to be learned just right, a hair's-breadth 
away from singularity, whereas a model with low condition number would be 
more tolerant of parameter variations. 

Consider the following, in which three thin-plate models are learned having differ- 
ent correlation lengths (and condition number) based on the value of their center 
element. All of the sample images are 128 x 128 in size: 



20.1 
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20.01 
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Long Correlation 
(Large k) 



We can clearly see the reduction in performance as we move from well condi- 
tioned (short correlation length, left) to poorly conditioned (right). 

The plots on the facing page summarize this, plotting the probability of finding 
a positive-definite model as a function of prior model correlation length and the 
given size of the sample image: 



Example continues . . . j 
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Example 6.4: Model Inference and Conditioning (cont'd) 
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It is important to recognize that a set of model parameters g which fails to be 
positive-definite is not inherently useless. These model parameters could still be 
used as a feature for texture discrimination in pattern recognition, or texture seg- 
mentation in image processing, for example. However, as a statistical prior such 
a model would have no validity. 



6.7 Choices of Representation 

The parallels between Gibbs and Gauss-Markov fields, and the related parallels with 
the models introduced in Chapter 5, motivate a renewed examination of the question 
of representation, following the discussion of Section 5.8. 

As is illustrated in Figure 6.13, we have a total of twelve options, laid out along three 
independent axes: 

1. Specifying constraints directly or in squared form; 

2. Specifying a convolutional kernel versus a full form; 

3. Selecting a deterministic model versus a regular or inverse statistical one. 



The comparative tradeoffs and issues are mostly the same as in Section 5.8, and 
do not need repeating here. What is new is the set of Gibbs-Markov models 
(Q, G, V, V), which neatly straddle deterministic models (Q,Q,jC,L) and the sta- 
tistical dynamic models (P, P, A, A)\ 
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Q— G 

/| \-l 





Fig. 6.13. Building on Figure 5.19, we now have twelve forms of representation, left, where 
the forms differ based on the choice of three underlying aspects: representing a model or its 
square root, full or convolutional-kernel, and deterministic or statistical modelling. The middle 
set of inverse models ((?, V, G, V) are those contributed by random field modelling. 



1. Interpreted as inverse-co variances, the Gibbs-Markov models have a straightfor- 
ward interpretation as representing the inverse- statistics of a covariance or dy- 
namic model. 

2. Viewed as equivalent, by analogy, with the deterministic constraint models, the 
Gibbs-Markov models give a clear, statistical interpretation to local/sparse mod- 
els which were previously much less explicit. 

3. The Gibbs model extends the previous dynamic (A, A) and constraint (X, L) for- 
mulations by allowing more general classes of models, including discrete- valued 
states and non-quadratic error criteria. 



Application 6: Texture Classification 



The problem of texture classification is exceptionally well studied, and enjoys a lit- 
erature spanning more than two decades [60, 225, 226, 241, 242, 258-260]. 

Suppose that we have a number of textures, for example the five samples from the 
widely- studied Brodatz set [45] shown in Figure 6.14. 

We assume these textures to be known, making this a supervised classification prob- 
lem [91]. We can learn a Markov random field model for each of these textures, 
as was done in Example 6.3. Then, given an image to segment into its constituent 
textures, we can attempt to relate the unknown image to the learned models. 

There are many ways in which such a classification can be undertaken, however let's 
consider a very basic approach. From each texture image I t we extract a number 



Texture Classification 
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Fig. 6.14. Given five sample Brodatz textures [45], top, we wish to use a random field model 
to segment a given composite image, below. 



of samples {S ty i} of size TV x N, and a Markov model g(S) of order q is learned 
from each sample. Then, given an unknown sample S, we learn its model g(S), and 
classify the texture 

t(S) = *ig t mhi\\g(§),{g(St t i)}\\ (6.78) 

where the norm || • || needs to assess the closeness of the unknown model g to the 
learned sets. The simplest norm is a Euclidean nearest-neighbour [91] approach, 
treating the Markov model as a vector of GMRF coefficients: 



i(S) = arg^ min jmin|#(S), {#(%)} | 2 } • 



(6.79) 



Figure 6.15 plots the probability of a correct texture match, as a function of the tex- 
ture patch size TV, and as a function of the size of the Markov model neighbourhood. 
Clearly the probability of a correct match increases with N, since a larger sample 
more distinctly characterizes a given texture, however the lower performance with 
higher model order may be surprising: although a higher model order does represent 
a given texture better, the increased number of parameters in the model leads to much 
greater variability in g and a poorer classification. 

Selecting a patch size of 30 x 30 and a second-order (3x3 neighbourhood) Markov 
model, with the nearest-neighbour classifier of (6.79), leads to the results as shown 
in Figure 6.16(a). Because each texture patch is processed independently of the oth- 



210 



6 Markov Random Fields 



a € 0.9 






,g 






-S 












5 0.8 






§ 


















1 0.7 




/ / ^y / 


s 


3x3 


/ / ^^ / 








g> 0.6 




5x5/ / y 


5 






U 






^0.5 




/ / 


O 




7x7/ / 








"1 0.4 




/ / 








? 




/ 9x9/ 


a. 






0.3 




/ 



Image Patch Size 



Fig. 6.15. The probability of matching a texture sample of size N x N to its correct class, 
for four Markov neighbourhood sizes. More local models and larger texture patches lead to 
improved matching probability. 





(a) Segmented Results 



(b) Median Filtered 



Fig. 6.16. The segmented composite texture of Figure 6.14. When each texture patch is seg- 
mented individually, left, there are no spatial constraints and the resulting classification is 
somewhat noisy. A median filter can be used to assert smoothness in the classification, right. 



ers, there is no spatial constraint and the estimates appear noisy. Essentially what we 
would like here is a prior model on the underlying texture label; such an underlying 
prior model is a hidden model, the subject of Chapter 7. A very simple spatial con- 
straint is asserted by a median filter [54], which generates the imperfect, but credible, 
results shown in Figure 6.16(b). 



Summary 211 

Summary 

Markovianity means a conditional separation, the separation of two parts by some 
boundary: 

ID : For one-dimensional problems, one or more values of the random process form 
the boundary separating past from future; 

dD : For multidimensional problems, it is easier to talk about the boundary required 
to separate a single pixel (inside) from the entire rest of the domain (outside). 

In separating a single pixel from the rest of the domain, the boundary which affects 
this separation is the neighbourhood N '. The size of the neighbourhood is the model 
order. There are two classes of models: 

1. Causal / acyclic models, in which the state elements can be ordered, leading to 
efficient algorithms but relatively poor models. 

2. Noncausal / loopy models, in which there is no ordering, leading to greater com- 
plexity, but also better modelling. Nearly all of the examples in the text are drawn 
from the noncausal case. 

If a Markov field obeys Gaussian statistics, we have the important class of Gauss- 
Markov Random Fields (GMRFs), 

Zj = ^2 5j,kZk + Wj E[z k Wj] =0 V j ^ k, (6.80) 

keJVj 

characterized by model parameters g. If the random field is stacked into a vector, and 
the model parameters similarly into a matrix, then (6.80) is neatly rewritten as 

Gz = w E [zw T ] diagonal. (6.81) 

The correlation of the noise structure is determined by G, 

cov(w) <t=^ G (6.82) 

and the model G is interpreted as an inverse covariance: 

cov(z) ^=^G~\ (6.83) 

so that assuming an inverse covariance to be sparse and banded is equivalent to as- 
suming that the random field is Markov. 

The four essential facts of GMRFs: 
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1 . The GMRF model parameters are essentially the matrix-inverse of the covariance. 
Therefore the matrix-inverse of a Markov covariance will be highly sparse. 

2. The driving noise process w is not white. If we assert that the noise is uncorrelated 
with the random field, 

E[zw T ] =a 2 I (6.84) 

then it is not possible to also assert that the noise is white. 

3. The scale of the random field is controlled by the model parameters g. 

4. The complexity of the random field is controlled by the model order. 



For Further Study 



The question of what happens to Markov random fields when cast into a hierarchical 
context is looked at in Section 8.5. 

Markov fields are often used in pairs, with a hidden or underlying field serving to 
condition a second, visible field. The use of and modelling with hidden fields will be 
discussed in Chapter 7. 

The fastest and easiest ways to get started with random fields is by using fast Fourier 
transforms (FFTs). The connection between Markov random fields and FFTs is ex- 
plored in Section 8.3. Because Markov random fields imply a sparse P _1 they are 
compatible with iterative approaches to estimation, which are developed in Chap- 
ter 9. 

The texts by Winkler [335], Li and Gray [205], and Won and Gray [336] all compre- 
hensively discuss random fields, with Won and Gray focusing on Markov fields, and 
Winkler more on Gibbs fields. 



Sample Problems 



Problem 6.1: Markov Order 

(a) For each of the panels in Figure 6.4, what is the maximum order of the cor- 
responding Markov field if the field can be separated by the given boundary? 

(b) Construct three 2D boundaries, like the ones drawn in Figure 6.4, such that 
the maximum order of the corresponding Markov field is second-, third-, and 
fourth-order. 
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Problem 6.2: Membrane, Thin-Plate, and Markovianity 

Below are two figures, reproduced from Example 2.7, back on page 36: 





# 

1 A, Small 






Large 






* 




where we generated estimation results for one-dimensional processes given first- 
or second-order constraints. 

(a) Why is the first-order constraint Markov? What order of Markovianity does 
it possess? 

(b) Why is the second-order constraint Markov? What order of Markovianity 
does it possess? 

(c) The first-order results look piecewise-linear, leading to two questions: 

• Explain convincingly why the piecewise-linearity is due to first-order 
Markovianity. 

• Does this mean that the estimate at a point depends only on the two closest 
measurements, or not? 

(d) The figure below shows a two-dimensional first-order estimate, based on the 
measurements and plotting arrangement as in Example 5.1: 




Clearly these estimates are not piecewise-linear, although the prior is still 
first-order Markov. What is it about the difference between one- and two- 
dimensional problems that leads to the difference in behaviour? 
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Problem 6.3: Markov Model Inference and Conditioning 

We know two things: 

1. The worse the conditioning of a problem, the more sensitive the resulting 
estimates can be to model errors. 

2. The smaller a given sample random field, the greater the errors in the inferred 
MRF model. 

Therefore we might expect there to be an interrelationship between kernel con- 
ditioning and the amount of data needed to learn the kernel model. 

Consider the thin-plate kernel in Example 6.1, where we set the central element 
of the kernel to one of four possible values: 20.1, 20.03, 20.01, 20.003. The 
conditioning of the kernel becomes worse as this central element is decreased 
(the kernel becomes singular, corresponding to a condition number of oo, when 
the central element is set to 20). 

You will need to use the FFT method of Section 8.3, or access the sample Mat- 
lab code available online, in order to synthesize the sample random fields. 

For each of the four kernels: 

(a) Generate a random N x N sample, using the FFT, for some choice of N, 
using the true kernel. 

(b) Learn a fourth-order MRF model from the TV x N sample. 

(c) Now generate a random 128 x 128 sample, using the FFT with the kernel 
just learned. 

(d) Repeat the above three steps, experimenting with differences choices of N, 
seeing what N is required to generate reasonable results in (c). 

Comment on your observations. 

Problem 6.4: Open-Ended Real-Data Problem — Texture Classification 

There is an extensive literature [60, 62, 226, 260] on the use of Markov random 
fields in texture classification. Select an MRF-based method and apply it to a 
standard texture database, such as the Brodatz, CUReT, UIUC, or KTH-TIPS 
databases, all of which can be found on the Internet. 

Compare the classification accuracy which you obtain using a Markov approach 
with the published Markov methods [60, 62, 226, 260], and then also with more 
recently-published approaches, for example those based on local binary patterns 
[242] and local image patches [315]. 



Hidden Markov Models 



Working with nonstationary and heterogeneous random fields presents an interesting 
challenge. In principle, the modelling methods of Chapter 5 do allow us to construct 
a nonstationary model, under the assumption that the model boundaries are known. 

For example, given a noisy image M that we wish to segment into its piecewise- 
constant parts, we could attempt to learn a nonstationary model L NS from edge- 
detection on M, and then estimate Z based on model L NS . It is then possible that 
the pattern of nonstationarity will be more evident in Z rather than M, leading to an 
alternating iterative method: 



n/r Infer r (1) Estimate A Infer T (2) Estimate A sn a\ 

M =^ iA' => Zi => L\, => Z 2 •• (7.1) 



The approach of (7.1) is actually reasonably close to the method we eventually pro- 
pose for such a problem. However, rather than a heuristic guess at L NS , we should 
prefer a more systematic formulation. 

Consider the two panels in Figure 7.1. The left panel shows a multi-texture image, 
such that each of the individual textures (thin-plate, wood-grain) are Markov, 1 how- 
ever the resulting image is not, since a pixel at the boundary between two textures 
needs more than just a local neighbourhood to understand its context. Similarly the 
right panel shows a two-scale image, which superimposes a fine-scale (thin-plate) 
texture on a large-scale binary field. 

The key idea behind hidden Markov modelling is that many complex images, scenes, 
and phenomena can be modelled as combinations of simpler pieces. Although the 
details are yet to be defined, the essence of the approach is illustrated graphically in 
Figure 7.2. 



1 This chapter assumes an understanding of Markovianity and random fields from Chapter 6. 
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U U ; 1 ■' r> .4 



A nonstationary field 



A two- scale problem 



Fig. 7.1. Two examples of complex, nonstationary behaviour. How do we construct models 
for such fields? Are these even Markov? 



7.1 Hidden Markov Models 



Hidden Markov Models (HMMs) [26, 127, 205-207, 265] have a long history in 
stochastic signal processing, most significantly in the speech analysis literature. The 
attractiveness of HMMs stems from their intuitiveness and tractability in formulating 
complex inverse problems. 

We start with a single random field Z, as shown in the top panel of Figure 7.3. 
For modelling simplicity, we assert that Z is stationary, Markov, and has a local 
neighbourhood. The local neighbourhood must therefore be capable of decoupling 
a pixel Z{ from the rest of the domain, therefore it is not possible for Z to have 
structure on more than one scale, since a local neighbourhood cannot simultaneously 
discriminate two separate scales. 

To induce more complex structure, we therefore need multiple random fields, com- 
bined in some way. The combining will be developed through the following four 
examples. 



7.1.1 Image Denoising 



If we have a Markov field z and add noise v to it, the observed sum 
m = z-\-v z~ Markov v ~ White 



(7.2) 
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Simple: Stationary, Markov 




Complex: Nonstationary, Non-Markov 




Simple: Stationary, Markov 

Fig. 7.2. The essential premise of hidden modelling is to decompose a complex problem, left, 
into multiple simple pieces, right. The non-Markovian statistics, left, become Markovian by 
conditioning on a hidden or underlying field. 



falls directly into the context of linear inverse problems discussed throughout Chap- 
ters 2 and 3. 



Although we know the analytical solution to (7.2), let us deliberately reformulate the 
problem in terms of a conditional density: 
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I. A single field, left, may be independent, discrete- state Markov, or continuous- state Markov, 

respectively. 
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II. Given the hidden field (o), the observed field (•) is conditionally independent. 
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III. A generalization of case II, such that the visible, conditional field is Markov, rather than 

independent. 



Fig. 7.3. A progression in complexity of three modelling structures, from a single field (I), to 
a conditionally independent field (II), to a conditionally Markov field (III). The random field 
meshes are sketched in two dimensions, but apply equally well to one- or three-dimensional 
processes. Since such sketches become cumbersome for large or complex problems, we sum- 
marize each structure with a simplified diagram, right, where nodes are conditional on all 
connected nodes from below. 



p(m, z) — p(rn\z) • p(z) 



(7.3) 
(7.4) 



Independent 



Markov 



That is, we have an underlying hidden Markov field z, which makes the observed 
field m conditionally independent, as sketched in the middle of Figure 7.3. 
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Texture Denoising Image Segmentation Texture Segmentation 

(Section 7.1.1) (Section 7.1.2) (Section 7.1.3) 
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Textures 



Texture Labels 



Fig. 7.4. Three illustrations of two-layer hidden Markov models. In each case, a more com- 
plex scene (top) is simplified by absorbing one component of its structure into a hidden layer 
(bottom). Conditioned on the hidden layer, the top image becomes independent (left, centre) 
or Markov (right). 



7.1.2 Image Segmentation 



In image segmentation [35, 36, 54, 205], we wish to take a given, possibly noisy, 
image M and partition it into piecewise-constant regions. In practice, segmentation 
may involve colour or multispectral images, or the segmentation of multiple textures 
(Figure 7.4); however, for the purpose of this example, let us assume M to be a noisy, 
piecewise-constant grey scale image, as illustrated in Figure 7.4. 
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The underlying (hidden) label field U is now discrete, such that u\ = j G & asserts 
that pixel mi belongs to the jth region. The forward model is very similar to (7.2) 
for denoising: 

Uk = f(u) +V. 1L~ |^|-state Markov v ~ White (7.5) 

such that f(j) identifies the grey-shade of region j. The conditional formulation 
proceeds as before: 

p(m,u) = (^JJp(mJ^)J • p(u) • (7.6) 

Independent Markov 

Because U is now a discrete- state field, we need to select an appropriate discrete 
prior model, a few of which are discussed in Section 7.4. 

The strength of this approach is that we have explicitly separate prior models for the 
visible image p(m\u) and for the quite different behaviour of the underlying label 
field p(u). 

One obvious limitation to (7.6) is that the image model is simplistic, with additive 
white noise on piecewise-constant regions. In practice we are interested in segment- 
ing more subtle behaviour, such as different texture models, addressed next in Sec- 
tion 7.1.3. 

A second limitation is that the association /() between region label and image value 
is assumed to be known ahead of time. In practice this association is found via a 
clustering method, such as K-means [91]. An alternative approach, based on a hidden 
edge model, is discussed in Section 7.1.4. 



7.1.3 Texture Segmentation 

We are now given an image M composed of multiple textures, as shown in Fig- 
ure 7.1, which we would like to segment or decompose [225, 226, 260]. M consists 
of K Markov models, where the choice of model is determined by the if -state hidden 
field U, 

UiE¥ = {l,...,K} (7.7) 

as illustrated in Figure 7.4. 

The reader will find the generalization of (7.6) to the texture segmentation context 
quite straightforward by now: 

p(m,u) = p(rn\u) ■ p(u) . (7.8) 

Markov Markov 
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Fig. 7.5. A possible hidden model for edge detection. A hidden layer consists of binary edge 
elements u (rectangles, top), declaring whether or not two adjacent pixels (•) in z are separated. 
The hidden layer u requires a prior model, bottom, where edge terminations and junctions are 
rare, relative to continuing segments or no edge at all. 



In the case where the observed image is noisy, an additional layer is added, such that 
we have hidden layers of label U and denoised texture Z\ 



p(ir,z,u) = \J\p(m\zS) -pUh)- p(u) • 



(7.9) 



Independent 



Markov Markov 



Indeed, the attractiveness of the hidden Markov approach is how simple and intuitive 
such decompositions can become. 



7.1.4 Edge Detection 



Suppose we wish to segment an image, as in Section 7.1.2, but without assuming 
knowledge of the relationship /() in (7.5) between the image and the hidden labels. 

The alternative to a hidden label, identifying what region a pixel belongs to, is a 
hidden binary state u x ~y, lying between adjacent pixels in z, declaring whether a 
region boundary is present [127] as sketched in Figure 7.5: 



^x,y 



lx iV 



= => No edge is present =^> z XjV , z x +±^ y constrained to be similar. 
1 => An edge is present => z Xi y,z x +i iy decoupled, no constraint. 



(7.10) 



222 7 Hidden Markov Models 

Explicit two-dimensional indices are used here, to simplify the notation relating u 

and z. 

If the observed image is noisy, then we have the familiar three-layer hidden Markov 
model 

p(m,z,u) = \J\p(m Xi y\z Xt y)j -p(z\u)- p(u) . (7.11) 



Independent Markov 2-Markov 

The model requires that we specify a prior p(u) for the binary edge field. In the 
absence of other knowledge, this prior is most easily specified via the probabilities 
of local edge groups, as shown in Figure 7.5. The resulting model is most easily 
written as a Gibbs energy (see Section 6.5): 

H(m,z,u) = ^2(m X: y - z XjV ) 2 + ^(l - u^ y )(z x , y - z x+hy f 

x,y x,y 



Measurement Noise Vertical Edges 



(7.12) 



Horizontal Edges 

,1 



i~ / j ■"■ \ u x,yi u x,y> u x+l,yi U x,y+l) ' 



Edge Prior 

In practice, (7.12) may be a bit simplistic, and we may instead wish to assert non- 
quadratic penalties between pixels, and to weight differently the various parts of the 
energy function. 



7.2 Classes of Joint Markov Models 

A very wide variety of hidden Markov models has been proposed for image mod- 
elling and analysis, as summarized in Figure 7.6. The actual labels and categories 
are relatively unimportant: when faced with a spatial modelling problem, no attempt 
should be made to fit the problem into one of these predefined categories. It is far 
preferable to examine the hidden behaviour present in a problem, and to propose a 
structure reflecting the behaviour which is actually present. 

The quick survey presented here is primarily to expose the reader to the commonly 
employed models, with citations to the literature for further reading: 

• Markov Models [86, 127] are discussed in detail in Chapter 6. In particular, a 
single, stationary Gauss-Markov random field is described by a kernel, and is 
able to model textures and patterns having a single characteristic scale. 
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Example 7.1: Multiple Hidden Fields 



There is no significant distinction between having one or multiple hidden fields: 
for each distinct attribute or behaviour appearing in an image, we require a sepa- 
rate hidden field to control the attribute's presence or absence. 

For example, the four regions in the relatively complex image below are described 
in terms of two attributes: circle size and circle density. We can therefore construct 
a hidden system consisting of two discrete- state hidden fields. The top image is 
Markov only when conditioned on both of the hidden fields. 




Scale 



Density 




Clearly an additional attribute, such as colour, would just lead to an additional 
hidden field. 

Similarly if the attributes take on more than two possibilities, such as having cir- 
cles of three different sizes, then the scale hidden field becomes ternary, rather 
than binary. 
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Fig. 7.6. A representative sample of the common hidden Markov models proposed for image 
analysis and modelling. The difference between the Double and Pairwise models is that the 
two fields in the Double model are Markov and conditionally Markov, whereas in the Pairwise 
model the fields are jointly Markov, but neither field is Markov on its own. 



• Hidden Markov Models [26, 127,205-207] have an underlying Markov field Z, 
such that the visible field M\ Z is conditionally independent. This narrow interpre- 
tation of a hidden Markov model essentially applies only to noisy Markov fields, 
so for most problems of multidimensional modelling more substantial structures 
are required. 

• The Double Model [230] allows the conditional field to be Markov, allowing for 
hidden labels and visible textures, as in Section 7.1.3. The model can clearly be 
generalized to allow additional layers. 

• The Factorial Model [193] allows for a greater number of conditioning fields, as 
in Example 7.1, where a given field depends on multiple factors, each of which 
may be independently modelled as a random field. 
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Fig. 7.7. A hidden Markov tree: The hidden binary state under each wavelet coefficient de- 
clares whether the corresponding coefficient is of small or large variance. Although the struc- 
ture appears relatively complex, it is actually acyclic (no loops), so inference and estimation 
can be done very efficiently. 



• The joint models, such as the Pairwise [260] and Triplet [21] models, give in- 
creased model flexibility but at a substantial computational cost. A pair of fields 
(Z, U) being jointly Markov implies that [Z\U\ (U\Z) are both conditionally 
Markov, however neither Z nor U is Markov on its own. 

Similarly a triple (M, Z, U) being jointly Markov implies that doubly-conditioned 
fields (M\Z, U) are conditionally Markov, but singly-conditioned (Z\U), (U\Z) 
and unconditioned Z, U are not. 

• A great many specialized models have been proposed, including the edge model 
[127] of Section 7.1.4, multimodels [22], and wavelet models [77,274]. As multi- 
dimensional spatial operators, wavelets (discussed in Chapter 8) are of particular 
interest. The hidden Markov tree, illustrated in Figure 7.7, uses a binary hidden 
state to model the observation that the variance of child coefficients tends to match 
the variance of the parent. 

• Spatial hidden Markov models are part of a much larger literature on complex 
and distributed Bayesian inference, including graphical models [180], Bayesian 
networks [177, 249], and hierarchical hidden Markov models [116]. 



7.3 Conditional Random Fields 

The models we have seen thus far are generative: a joint distribution 

p(Z,U)=p(Z\U)p(U) (7.13) 

describes all of the statistics of Z, meaning that we can generate or draw samples of 
Z. 
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Although the explicit goal of this text is the development of multidimensional gener- 
ative models, let us briefly consider a different perspective, which is that for certain 
problems the generative model says much more than we need. Particularly for la- 
belling or segmentation problems, if we are given an image M from which we wish 
to infer labels U, we do not need to be able to describe the statistics of M. M is 
given; we only need to be able to describe the dependence 

p(U\M) (7.14) 

of [/ on M. This is a conditional or discriminative model, able to discriminate be- 
tween possible label fields U±, U2, • • • but without any implied model for M. 

The generative models of this chapter, such as in (7.6), were simplified by assuming 
the conditioned distribution to be Markov or independent. Consider the following: 

p(M\U) -> Y[p(mi\U) p(U\M) -> Y[p( Ui \M) . (7.15) 



Generative Discriminative 

We observe: 

• The generative model describes a single image pixel rrii on the basis of a nonlocal 
set of labels, 

• The discriminative model describes a single label U{ on the basis of a nonlocal 
portion of the image. 

So in direct contrast to the generative approach, discriminative methods are attractive 
because the image M is not modelled, typically leading to models having fewer 
parameters and increased robustness, and because the single label ui is conditioned 
or described in terms of nonlocal portions of the image. The only aspect which is 
missing is some form of interdependence between label values, so we would prefer 
a discriminative model along the lines of 

p(U\M) ^Y[p( Ui \u^M) (7.16) 

over some neighbourhood Mi . 

One approach to discriminative modelling is that of conditional random fields [198, 
295], in which the conditional distribution is written as a Gibbs model (Section 6.5) 



p(U\M) 



^expf^/i(^,M) + / 2 (^,^_i,M)j. (7.17) 



The labels {ui} are assumed to be sequentially ordered, and the dependence of U on 
M is described by feature functions j\ and /2, the single-site and transition features, 
respectively. 
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It is, of course, possible to generalize conditional random fields to multidimensional 
problems. We have already repeatedly encountered this question, in the context of 
coupling and complexity in Section 5.2, and in the context of one-dimensional versus 
multidimensional random fields in Chapter 6. The short summary is that sequential 
models are efficient but tend to lead to poorer models, so non- sequential iterated 
approaches are preferred, based on some sort of neighbourhood, as in 



p(U\M) 



^expf^/i(^,M) + / 2 (^,w M ,M)j. (7.18) 



In addition to not modelling M, the great strength of the conditional random field ap- 
proach is the use of a Gibbs framework, which allows great flexibility in modelling, 
since nonlinear/discrete- state models are easily accommodated and there are no is- 
sues of normalization or positive-definiteness. For example, the image segmentation 
method in [157] is based on a conditional random field model 

p(U\M) = — exp I ^2 fi( u h M , d) + h(ui, UtfL**, 0) + fo(ui, u^-uk* , 6) J 

(7.19) 
with unknown model parameters 6 to be learned. Conditional random fields have 
seen growing application in computer vision, to problems of labelling, segmentation, 
and object tracking [157, 196, 327]. 



7.4 Discrete-State Models 



In many cases the hidden field is a discrete label, such as the high/low variance of 
a wavelet coefficient, the region in an image, the identity of a texture, or the classi- 
fication of a pattern. In any such case, the hidden field will require a discrete- state 
model. Although most discrete- state models are fairly simple and intuitive, there are 
two key difficulties: 

1 . There are surprisingly few models to choose from, and of these even fewer that 
actually provide anything but the most trivial structure. 

2. As soon as a discrete-state model is included, the overall problem becomes, by 
definition, a discontinuous function of the state, and therefore nonlinear. 

It is because of nonlinearity that the majority of this text has focused on the 
continuous- state case, nevertheless the importance of hidden Markov models mo- 
tivates at least an overview of discrete- state methods. 
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Fig. 7.8. The two-dimensional Ising model: three random samples are shown, for different 
values of inverse-temperature parameter (3. As (3 is increased, the coupling between pixels is 
more strongly asserted, and the prior states a greater preference for homogeneous regions. 



7.4.1 Local Gibbs Models 

The Ising model [26, 171, 335] is the most famous of all discrete- state models. Origi- 
nally developed as a model of ferromagnetism in statistical mechanics, its early fame 
stemmed from the fact that the two-dimensional problem was solved, analytically, by 
Onsagerin 1944. 

The regular Ising model, with no external field, is a simple, first-order Markov model: 
H(U) = - y^(ujju i+1 j + UijUij+i) Uij E {-1, 1} = &. (7.20) 

Recall from (6.39) in Chapter 6 the relationship between energy and probability den- 
sity for Gibbs random fields: 

P(u) = y~ myd - (7-21) 

Thus a lower (more negative) energy H is associated with a higher probability den- 
sity; that is, the Ising model states a preference for regions of constant state value in 
Z. 

Three random samples of u, drawn from p(u) in (7.21), are shown in Figure 7.8. 
The Ising prior is a function of only a single parameter, the inverse-temperature (3 in 
(7.21). Correspondingly, the Ising model is a very weak prior, capable only of stating 
a vague preference for large blob-like objects. 

The Potts model [26, 258, 335] is the generalization of the Ising model to the i^-ary 
case, for states taking on one of K > 2 discrete values: 

H ( U ) = ~Y1 <W«.i-i - Z) «W««-i.i u ^ e {1, 2, . . . , if } = !P. (7.22) 
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Qualitatively, the behaviour of the Potts model is very similar to that of Ising — it is 
a single-parameter first-order local prior, preferring large homogeneous regions. 



7.4.2 Nonlocal Statistical-Target Models 

The strength of the Gibbs approach to statistical modelling lies in the assertion of 
intuitive energy functions. That is, unlike co variance matrices (which need to satisfy 
subtle positive-definiteness requirements), the energy function has only one require- 
ment: 

Assign lower values to "better" states, and larger values to "poorer" states. 

One of the simplest and most common approaches is to define one or more attributes 
A q (U), with idealized target values learned from training data 

T q = A q (U) (7.23) 

such that the energy function is then expressed as the degree of inconsistency be- 
tween the attributes of a given state Z and the target: 

H(U) = \\T-A(U)\\. (7.24) 

In the simplest case, we just penalize the squared difference between attribute and 
target: 

H(U) = J2(T q -A q (U)f. (7.25) 

q 

Any feature which can be written as a function of a discrete- state field can there- 
fore form the basis for a Gibbs energy. Common features include the presence in U 
of horizontal, vertical, or diagonal lines of length q, the two-point correlation as a 
function of offset q, the fraction of states having a given state value, or many other 
morphological features. 

The Chordlength model [312] characterizes a random field on the basis of the distri- 
bution of black or white chords. Define A { ,V (U) to be the number of chords present 
in U of length I and state value v in direction d. Then the energy comparing a given 
field U to some ground truth U is 

H(U) = ^2(Af v (U)-Af v (U))\ (7.26) 

d,v,l 

Related is the Correlation model [312], which characterizes a random field based 
on autocorrelation. Clearly a variety of alternatives is present, such as whether the 
correlation is specified separately in the horizontal and vertical directions, or a single 
isotropic model, and also the maximum offset to which the correlation is asserted. 
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Fig. 7.9. Two possible energy targets for a binary random field: chordlengths (middle) and 
correlation (right). Observe how the model reveals the presence of large black pores via a 
heightened target on long black chords in the middle panel. 



Because the extracted attributes A() are nonlocal in support, both the correlation and 
chordlength are global models, distinctly different from the local Markov models 
seen throughout Chapter 6. Figure 7.9 illustrates extracted target distributions for a 
sample, binary random field of a porous medium. 



7.4.3 Local Joint Models 



The strength of the chordlength and correlation models is that they are global and 
are able to describe large-scale phenomena, such as the large black pores seen in 
Figure 7.9. On the other hand, the weakness of these nonlocal models is that they are 
poor at capturing and describing local, fine-scale, detailed morphology. 

As an alternative, local joint models over a small window of state elements have 
been proposed. Although Local Binary Patterns (LBP) [241,242] have been used 
most frequently as features in pattern recognition problems, such as texture classifi- 
cation, as image attributes they are equally applicable to being asserted in an energy 
function (7.25). In their most basic form, LBP features are based on the eight pixels 
surrounding a central one, are rotation invariant, and are based on those nine patterns, 
shown in Figure 7.10, empirically deemed to be the most significant. 

Very similar is the Local Histogram model [5], which preserves the joint probability 
of each of the 2 9 configurations of binary elements in a 3 x 3 window, leading to 
energy 



H <p) = E 



}{p q (u)-p q (u)Y 



(7.27) 



g=0 



where cr q represents the variability, from sample to sample, of the probability P q of 
configuration q, and small e > is chosen to avoid a division by zero for impossible 
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Fig. 7.10. In contrast to the non-local Gibbs models of Figure 7.9, here two local models are 
proposed, based on Local Binary Patterns (top), or an exhaustive set of joint combinations 
(bottom). If a prior model with substantial structure and morphology is required, these local 
models are best combined with nonlocal ones, or used in a hierarchical setting. 



(zero probability) configurations. In practice, the number of configurations may be 
reduced from 2 9 by asserting reflection / rotation invariance. 

The flexibility of the Gibbs energy approach allows new energy functions to be pro- 
posed at will. For example, because the LBP and histogram models are local, there 
may be some merit to combining them with a nonlocal model such as chordlength: 



^Combined (U) — ^Chordlength (U) + O^Histogram (U). 



(7.28) 



7.5 Model Determination 

Whereas Section 5.7 examined the question of modelling, in general, and Section 6.6 
in the context of Markov fields, here we consider modelling issues for hidden fields. 

If there is only a single, fixed prior model for the hidden field, then given a model 
and ground-truth data for U we can infer the model parameters 0, as was done in 
Figure 7.9. 

On the other hand, if the model parameters may vary with and be a function of 
image Z, then needs to be learned in each context. Furthermore, since the hidden 
field is, well, hidden, it is not available for parameter learning, and must be learned 
from the visible field Z or its measurements M. 

The difficulty, as discussed in Section 5.7, is that without knowing U, it is exception- 
ally difficult to formulate the maximum-likelihood estimation problem 



= arg. maxp(Z|#) or 0_ — arg. maxp(M|#) 



(7.29) 
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Algorithm 2 Simplified Expectation Maximization 



Goals: Estimate the parameters for hidden field U 
Function [(7,0] = EM(Z) 
Initialize 
while not converged do 

E-Step: Given model parameters 0, compute the hidden estimates U <— E[U\Z,ff\. 
M-Step: Given hidden estimates U, compute ML estimates <— arg maxp(?7|0). 
end while 



since the influence of on Z,M is felt via U, whereas the hidden maximum- 
likelihood problem 

= arg^maxp(/7|0) (7.30) 

is normally very easy, if U were known. The basic idea, then, is a straightforward 
extension of the above insight, which is to alternate between estimating and U, 
as shown in Algorithm 2. This is a somewhat simplified description of what is 
known as the EM (Expectation-Maximization) algorithm [25,85,275], which is near- 
universally used in hidden Markov models. 2 Slightly more precise than Algorithm 2, 
the EM algorithm consists of two steps: 

E-Step: Compute the expectation E u [p(z, u\6) \z, 0^] 
M-Step: Find i+1 to maximize the expectation E 

The EM algorithm is nothing more than the iteration of these two steps; EM does 
not tell you how to compute these steps. Indeed, given that the E-Step is essentially 
an estimation problem, implementing the EM method for a large, multidimensional 
problem will rely on the estimation methods of this text to compute the estimates. 
The latter M-Step is normally a comparatively straightforward ML problem. 

Although the EM algorithm cannot guarantee convergence of to the ML estimate, 
the likelihood of is guaranteed not to decrease with any iteration, meaning that EM 
is guaranteed to reach a local maximum in likelihood space. 

For acyclic HMMs, in which the statistical dependencies have no loops (Figure 5.1), 
efficient EM implementations exist based on the forward-backward algorithm [265, 
275] for sequential problems, and a very similar upward-downward algorithm on 

trees [77,275]. 
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Fig. 7.11. Two standard test images, "Peppers" and "House". The images are plotted in 
greyscale, however the underlying images are actually in colour. 



Application 7: Image Segmentation 



Let us continue to explore the image segmentation problem, as sketched in Fig- 
ure 7.4: 

Given an image I, we wish to partition the image into K segments. 

Because each image pixel I XiV is placed into one of K discrete segments, we clearly 
have a hidden label field U x , y E {1, . . . , K} = \P, describing the partition of each 
associated pixel. 

The goal of this application is to illustrate how the prior model for U affects the 
segmentation result. We will demonstrate segmentation on the two images of Fig- 
ure 7.11. 



No Prior Model 



We begin with no prior model for U, such that nothing is known about the spatial 
properties of U, and the segmentation problem is non-Bayesian. To group the ele- 
ments of I into one of K groups, on the basis of the colour of each pixel, is known 
as a clustering problem: 



2 In the context of hidden Markov models the EM algorithm is also known as the Baum- 
Welch algorithm. 
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KrWTH 



K-means segmentation 






t 



K-means + Potts prior segmentation 

Fig. 7.12. Image segmentation based on a global K-means clustering, with (bottom) and with- 
out (top) a spatial prior model on the hidden label field. 



[c, U] = arg c rj min 



x,y 



C U X 



(7.31) 



A very standard solution to clustering is the K-means algorithm [91], for which the 
segmentation is shown in Figure 7.12. Because each cluster center q is a global 
parameter, all parts of the image are segmented consistently. However, the absence 
of any spatial prior means that adjacent pixels are in no way constrained, and so we 
see the creation of large numbers of single-pixel regions at the interface between two 
segments. 
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K-Means with Potts Prior 

Given the K-means constraint of (7.31), it is straightforward to modify the criterion 
to include a spatial constraint, such as a Potts prior, on the hidden label field L: 

U = ar gc/ min ^ \\I XtV - c Ux , y || + Potts(/7) (7.32) 

x,y 

= arg^min^ \\I Xty - c Uxy \\ + J^^'^-i + S u x , vt u x - llV (7.33) 

x,y x,y 

where the Potts prior is taken from (7.22). 

Adding such a prior asserts a preference for spatially homogeneous regions, with the 
consequence that single-pixel regions are eliminated and leading to a clean, robust 
segmentation, as seen in Figure 7.12. 



Potts Prior 

The K-means approaches of Figure 7.12 have a global criterion, in that a single set 
of target colours {q} applies to the entire image. This requires, however, that the 
image be well described in terms of a fixed set of colours and that K be known. 
Counterexamples would be an image having regions of smoothly varying colour, 
or an image with multiple regions of very similar colour which are meant to be 
separately segmented. 

As an alternative we can consider a local model, such that the criterion is written 
entirely in terms of the local behaviour of both image and hidden layer. For example, 
we can assert a Potts prior, and penalizing the difference in adjacent image pixels if 
they belong to the same region: 

U = arg^ min^ Su^u^y-t {Ux, y ~ h, y -i || -7) + 

x,y (7.34) 

S U XlVi U x - llV {\\Ix,y ~ Ix-l,y\\ ~ l) U x , y E&, 

where constant 7 controls the degree to which the Potts prior is asserted. 

Although K must still be specified, because of the locality of the model, K no longer 
limits the total number of distinct segments, as in (7.33). Rather, K now asserts how 
many different regions may touch; a larger K gives more flexibility to the model, but 
also (typically) increases complexity. 

As discussed further in Chapter 8, because the model is local it is difficult to form 
large regions: the number of iterations needed is quadratic in region size or extent. 
That is, for aniVxiV image, with regions a fixed fraction of the height or width of 
the image, the total complexity is 
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Potts Prior 





Potts Prior, Repeatedly Aggregated 

Fig. 7.13. Image segmentation using only local models, in contrast to the global constraints 
imposed in Figure 7.12. Because single pixels are flipped, one at a time, it is exceptionally 
difficult to converge to large homogeneous regions, top. By flipping regions, rather than pixels, 
much larger regions can readily be formed. 



Complexity per Iteration • Number of Iterations = O (N 2 K) • O (N 2 ) = O (KN 4 ) . 

(7.35) 
The top panel in Figure 7.13 makes this clear: after more than 100 iterations, there are 
a great many small regions present. The formation of large regions is exceptionally 
slow, even in very simple, smooth regions such as the sky of the "House" image. 

It is possible to take these large numbers of small regions and create metrics [217] 
for grouping regions. Similarly we can treat each region as a single hidden element, 
such that our energy function is written in terms of regions [329], rather than pixels, 
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such that (7.34) becomes 



J7 = ax gt7 min^ ^ S UitUj (||I; - Ij\\ - 7) UiE9, (7.36) 

where A/j represents the set of regions adjacent to region j. 

By doing such grouping repeatedly, as discussed in Section 8.6, the overall con- 
vergence is accelerated greatly, leading to the segmentation in the bottom panel of 
Figure 7.13. Because of the absence of a global criterion, we do see many more local 
regions here, relative to the global segmentation in Figure 7.12. 



For Further Study 



The Viterbi method of Section 4.5.1 generates the estimated state sequence for hid- 
den Markov chains; generalizations to hidden random fields are discussed in Chap- 
ter 11. 

The paper by Geman & Geman [127] is strongly recommended, as understanding 
their edge model, discussed in Section 7.1.4, will leave the reader with a good un- 
derstanding of the use of hidden models with random fields. 

The text by Li and Gray [205] is recommended for readers wishing to understand 
hidden models more deeply. 

The Baum-Welch algorithm [19] is widely used to learn the parameters in HMMs. 
For discrete- state HMMs, the Forward-Backward algorithm [52, 96, 205] is a stan- 
dard, iterative approach for state estimation. 

To experiment in MATLAB, readers may be interested in the Image Processing tool- 
box, the Hidden Markov Model toolbox, or the Conditional Random Fields toolbox. 



Sample Problems 



Problem 7.1: Hidden Models for Simple Binary Fields 

For the two-phase random field in Figure 7.14, qualitatively sketch the associated 
hidden model along the lines of what is shown in Example 7.1: 

(a) What sort of visual behaviour does the hidden state describe? 

(b) Is the hidden state discrete or continuous? If discrete, what order is it (binary, 
ternary etc.)? 
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(See Problem 7.1) 



(See Problem 7.2) 



Fig. 7.14. Two sample binary random fields. A two-phase example, left, and a more complex 
structure, right. 



(c) Is the hidden state Markov? Why / why not? 

(d) Is the visible state conditionally Markov or conditionally independent? 

Problem 7.2: Hidden Models for Complex Binary Fields 

The binary random field shown in the right panel of Figure 7.14 shows multiple 
sorts of structures: pores, regions of high density, and a background region of 
lower density. Repeat Problem 7.1 for this image. 

Problem 7.3: Pairwise Random Fields 

Suppose that X, Y are jointly Markov, two-dimensional, and Gaussian. 

(a) Write the explicit statement of Markovianity for the joint pair (X, Y), along 
the lines of (6.13). 

(b) Prove that (X| Y) and (Y\X) are Markov. 

Problem 7.4: Triplet Random Fields 



Suppose that X, Y, Z are jointly Markov, two-dimensional, and Gaussian. 

(a) Write the explicit statement of Markovianity for the joint triplet (X, Y, Z), 
along the lines of (6.13). 

(b) Prove that (X| Y, Z) is Markov. 

(c) Argue why or in what circumstances (X| Y) is not Markov 
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Problem 7.5: Three-Dimensional Random Fields 

(a) Describe briefly the similarities and differences between 

(i) A 3D Markov random field, and 

(ii) A hidden chain of equally- sized 2D Markov random fields, linked via 
conditioning, like a long version of the Generalized Double model in 
Figure 7.6. 

(b) Prove that a chain of equally- sized 2D Markov random fields satisfies the 
conditions of 3D Markovianity. 

(c) Suggest a problem that might be more easily represented as multiple 2D 
fields rather than a single 3D field. 

(d) Under what circumstances might a single 3D field be more convenient than 
a hidden chain of 2D fields? 

Problem 7.6: Open-Ended Real-Data Problem — Segmentation 

There is a very large literature on image segmentation [36, 54, 258], with a subset 
specialized to the use of random fields in segmentation [35,226], and finally more 
specifically to the use of hidden/pairwise/triplet random fields [21,205,230,260]. 
Select one or more hidden-Markov segmentation methods, and apply the selected 
methods to standard segmentation test images. 

Compare your approach with those of non-Markov methods, such as based on 
watershed or region- growing. 



8 

Changes of Basis 



The previous chapters have focused on the definition of a model, and corresponding 
estimator or sampler, for some random vector z. Explicit throughout the preceding 
chapters has been the assumption that z contains a set of spatial elements or image 
pixels; that is, that z represents the raw, underlying random field of interest. 

However, a given model for z clearly implies a model for any linear function of z: 
z~P - = F -> z~FPF T (8.1) 



where, if F ^ I, the elements of z_ are no longer image pixels, rather some linear 
function, possibly local or nonlocal, of the original random field. Furthermore, if 
F 7^ I is square and invertible, then the transformation from z to z is referred to as 
a change of basis. 

Clearly what is lost in such a change is the simple, intuitive understanding of the 
state elements as image pixels. It is also possible that certain desirable properties 
of the model may be lost; in particular, most models increase in density (become 
less sparse) through a change of basis, and models which are stationary or Markov 
normally become nonstationary or non-Markov [197, 199,254]. Nevertheless there 
is much to be gained in terms of numerical robustness and computational efficiency. 

In particular, the poor conditioning of many estimation problems, especially in our 
context of random fields, stems from the locality of most Markov and deterministic 
constraint models (e.g., Figures 5.10 and 5.11 for deterministic kernels Q, and Ex- 
ample 6.1 on page 190 for Markov kernels Q). Figure 8.1 illustrates the problem: for 
a local operator to assert a statistical relationship between distantly- separated pixels 
requires information to be passed indirectly through many state elements. If z\ and 
zq in Figure 8.1 are uncorrelated then there is no statistical relationship to assert, 
and the problem may be well conditioned; however if z\ and zq are very tightly con- 
strained then we have a very strict assertion, but which needs to be inferred implicitly 
from the repeated application of a model. Such a tight correlation would be the case 
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Measurement 



Influence 



O O O O O O 

Z\ Z2 Zz ZA Z5 Zq 



Fig. 8.1. Local models lead to ill-conditioning because of indirection: given a measurement of 
element z\ and a first-order prior model, we need to infer z<2 from z\ , then zz from z<i and so 
on. The greater the number of levels of indirection, the more poorly conditioned is the system 
matrix and the more slowly a corresponding iterative solver converges. 



with a smooth spatial correlation, such as a Gaussian, which explains much of the 
conditioning behaviour in Table 5.2 (page 163). 

The indirection of statistical assertions and information flow can be clearly seen in 
Figure 8.2, which plots the iterative solution (discussed in Chapter 9) to a linear 
system. At the measured locations the state elements respond strongly to the mea- 
surement, however state elements some distance away from the measurements have 
only barely responded, and it will require a great many iterations before converging. 
The key idea in this chapter is that a change of basis, coupling non-neighbouring 
state elements, might accelerate this convergence. 

The fundamental mathematical approach to improving conditioning is to affect the 
distribution of eigenvalues in the system, however in most cases the details of the 
eigendistribution are unknown for a given inverse problem. Instead, in practice the 
desire to improve conditioning amounts to reducing the degree of indirect statistical 
assertions between the elements of z, for which there are three basic approaches, all 
of which are explored in this chapter: 

1. Reduce the Strength of Statistical Assertions: 

That is, find a change of basis to accomplish some degree of state decorrelation. 
The method of principal components is based on this idea. 

2. Make the Problem Smaller: 

That is, find a reduction of basis, whether by subsampling or by local methods, 
working on subsets of the random field, that the number of state elements is 
reduced. 

3. Make the Model Nonlocal, Reducing Indirection: 

That is, introduce a change of basis in which the basis elements are nonlocal, 
allowing spatially separated state elements to be coupled, reducing the number of 
steps of interaction. 
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Fig. 8.2. We can see the indirect nature of the local system, discussed in Figure 8.1, by ex- 
amining an iterative solution to a simple, first-order interpolation problem after 10, 100, and 
1000 iterations. Near the measurements the state elements are strongly pulled towards the 
measurement, however the information provided by the measurement decays exponentially as 
we move away from the measured locations. 



Because of their computational benefits, changes of basis take place implicitly in 
many efficient algorithms, as will be seen in Chapter 9. However the focus of this 
chapter continues to be modelling: how can we use basis changes to change a prob- 
lem, possibly to reduce it in size, to decouple it, or to restructure it in some fashion? 



The chapter begins with an overview of the mathematics of basis changes, followed 
by three approaches for basis reduction, and finally ending with three approaches to 
computationally efficient basis changes. 



8.1 Change of Basis 



A large number of approaches to basis change are possible, however they can be 
divided into two general approaches: 



1. Explicit changes, in which the entire problem, including the measurements, is 
transformed into a new domain. All aspects of statistical processing — modelling, 
estimation, prior / posterior sampling — can be undertaken in the transformed 
domain. 
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2. Implicit changes, in which an estimation problem is restructured such that the 
solution is found in the transformed domain, and then projected back into the 
original. Although somewhat simpler than an explicit change, implicit changes 
of basis apply only to estimation. 



Explicit Basis Change 

Consider a change of basis, as in (8.1). We transform a problem, solve it in the 
transformed domain, from which the solution is projected back to the original setting: 

Original Domain Transformed Domain 



Z rsj 



P F z~ P 



m = Cz + v m — Cz + v 

(8.2) 



Estimation 



z~P < z~P 

where the transformed problem is given by 

z = Fz~ P = FPF T (8.3) 

m = Fm = FCF~ x z_ + Fv = Cz + v. (8.4) 

That (8.2) is written as solving an estimation problem is for illustrative purposes 
only; really there is a wholly new model ((?, P), from which prior / posterior samples 
could be generated, model parameters estimated, or likelihood tests performed, in 
addition to estimation. Three further comments are in order: 

1. The purpose behind such a basis change is to make the problem easier in the 
transformed domain. For example, the transformed statistics FPF T should be 
well conditioned and/or sparse. 

2. We explicitly need both a forward F and inverse F _1 transform. If F alone is 
specified, finding the inverse transform may be extremely difficult. 

3. The transformation of the measurements implicitly assumes the measurements to 
be dense, meaning that the number of measurements and unknowns is equal. 

An important special case, which occurs in many image processing contexts (such 
as denoising 1 ) is where every pixel is observed, in which case C — C — I. 



1 See Figure C.5 in Appendix C.3. 
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The key to making this change of basis practical is to select an orthogonal transfor- 
mation for F, in which case 

=> The inverse transformation is easily found: F~ x = F T . 
=> If the measurement noise v is white, then v is also white. 

These properties make orthogonal transformations one of the key tools in sig- 
nal/image modelling and analysis. Indeed, later this chapter we show several ex- 
tremely well-known orthogonal transformations: the Fourier transform (Section 8.3), 
the Wavelet transform (Section 8.4.2), and the Karhunen-Loeve transform (Sec- 
tion 8.2.1). 

There is, in fact, a well-known orthogonal transformation which greatly simplifies 
statistical sampling and estimation. Consider the eigendecomposition P = V T A V 
of co variance P: 



z ~ . 



V z~ P = VPV T -- 

(A Diagonal) 



Trivial ( 8 - 5 ) 



F' 1 = V J 



That is, by transforming a given model to a very simple (uncorrelated) counterpart, 
the complex task of estimation is greatly simplified. 

The problem, of course, is that solving the eigendecomposition to determine this 
ideal change of basis is extremely difficult, normally much more difficult than solving 
for the estimates z themselves. 

The idea, rather, is to select a sub optimal change of basis, without solving an eigen- 
decomposition, examples of which are discussed in later parts of this chapter. 



Implicit Basis Change: 

In those cases where a suitable orthogonal transformation cannot be found, or where 
the measurement model is sparse or otherwise irregular, the explicit transformation 
of (8.2) may not be appropriate. It is possible, however, to implicitly reformulate an 
estimation problem in a transformed domain without the availability of an invertible 
transformation. Consider, for example, a standard estimation problem, recast as a 
linear system: 

z = (P- 1 + CFR^C)' 1 CFBr x m => (P- 1 + C T R~ l C) z = C T Rr x m 

=> Az = b (8.6) 
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The ease with which this linear system can be solved 2 is a function of the distribution 
of the eigenvalues of A, known as its spectrum. One quantification of eigenvalue 
spread is via the condition number k(A) , which follows from our earlier discussion 
of such inverse problems in Section 2.3. 

If we now suppose a change of basis z = Sz, then we obtain a modified linear 
system, analogous to (8.2): 

Az = b — ► S T ASz = S T b -^Az = b 

Eas y ? (8.7) 



The key question, then, is how the condition number k(A) of the modified system 
compares with k(A) of the original. To be sure, an interest in solving linear systems 
goes far beyond statistical image processing; indeed, this process of basis change 
for the purpose of improving conditioning has a significant history in the algebra 
literature and is known as preconditioning [39, 79, 98, 347]. 

There are several important aspects to observe regarding this preconditioning ap- 
proach, in contrast to the explicit change of (8.2): 

1 . The change of basis occurs only in the space of the unknowns z 9 not in the mea- 
surement space m. Thus no assumptions are made regarding the spatial distribu- 
tion or sparsity of the measurements. 

2. The preconditioned approach (8.7) never requires the evaluation of the forward 
transform S~ x . All that is required is a transformation S and its transpose S T , 
normally expressed implicitly as an algorithm rather than as a matrix, . 

3. It is not required that S be invertible; indeed, S does not even have to be square. 
However, for A = S T A S to be well conditioned A must be invertible, in which 
case S must have full column rank. Therefore S is either square-invertible, for a 
change of basis, or rectangular- full column rank, for a reduction of basis. 

4. The implicit formulation of (8.7) applies only to estimation, and does not extend 
to other aspects of statistical inference and sampling. 



2 Methods discussed in Chapter 9. 
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8.2 Reduction of Basis 

It is not obvious that the number of unknowns in the transformed space should equal 
the number in the original space. In particular, if z is known to be very smooth, then 
arguably relatively few coefficients might be sufficient to represent z very well, in 
contrast to a case involving a highly irregular or rough field. 

In the implicit case (8.7) no assumption was made regarding the squareness or in- 
vertibility of S, and the use of a reduced basis is straightforward. Of greater interest 
in reduced-order modelling is the explicit case. 

We start by considering the degree of approximation introduced by a reduction of 
basis. Suppose we propose transformations 

z = Fz z = Sz, (8.8) 

where F, S are rectangular matrices applied to the random field z~P, such that the 
size of z is deliberately reduced. Then the mean-square error in reduced representa- 
tion is 

MSE(z - SFz) = E[(z- SFz) T (z - SFz)] (8.9) 

= E [tr ((I - SF) T z T z(I - SF))} (8.10) 

= tr ((J - SF) T (I - SF)P) . (8.11) 

Thus under a regular change of basis (8.2), where S = F _1 , then (J — SF) = 
and the error in representation is zero, as expected. With a reduction of basis, F~ l 
does not exist, in which case (I — SF) ^ 0, normally 3 implying that the reduction 
of basis introduces some degree of approximation. 

Next, it is desired that the original space be able to represent perfectly any signal 
from the reduced space. That is, repeated projections of the form 

reduced => original => reduced 

should not incur additional error: 

= MSE(I - FSz) = E[(z- FSz) T (z - FSz)} (8.12) 

= E [tr ((/ - FS) T z T z(I - FS))] (8.13) 

= tr ((J - FS) T (I - FS)P) (8.14) 

from which it follows that FS = I, for which a sufficient condition is that F and S 
form what is known as a pseudoinverse pair. 4 



3 If the covariance P is singular then it is possible to design a basis reduction that is exact, 
with no approximation. 

4 More specifically, in this case F is referred to as a left-inverse of S; the pseudoinverse 
condition, discussed in Appendix A.9, is slightly more general. 
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Next, we have seen (8.5) the convenience associated with orthogonal transforma- 
tions. Suppose that F and S are rectangular portions of orthogonal matrices F, S: 



F = 
such that 



S= [S S ], (8.15) 



F = S' 1 = S\ and FS = I . (8.16) 

Orthogonality Pseudoinverse 

Then the mean- square error in the reduced representation is 

MSE(z - SFz) = MSE(SFz - SFz) = MSE(S F z) (8.17) 

= E[z T FjSjS F z] (8.18) 

= tr (SoFoPFjSj) (8.19) 

= tr (SjSoFoPFj) (8.20) 

= tr(F PFj) (8.21) 

= MSE(F z) (8.22) 

which is essentially a restatement of Parseval's theorem [243]: because of the or- 
thogonality of the transformation, the degree of error in approximating z equals the 
mean- square error of the omitted basis elements F z. Thus a "good" basis reduction 
is one in which the variance of F z is minimized. 

The following sections begin with the optimum reduction of basis, based on an eigen- 
decomposition, which is useful in dimensionality reduction, but inapplicable more 
generally for reasons of computational complexity. The two remaining sections will 
discuss suboptimal approaches to basis reduction. 

Section Method Basic Assumption 

8.2.1 Principal Components Problem stationarity along some dimension 

8.2.2 Fast Pseudoinverses Spatial smoothness 

8.2.3 Local Processing Local correlations or dense measurements 



8.2.1 Principal Components 

The method of Principal Components [91, 165, 166, 179,250] is one of the most fun- 
damental in statistical analysis, variously known as the Karhunen-Loeve transform, 
the Hotelling transform, or empirical orthogonal functions. 

Given a random vector z ~ P, we seek an efficient reduced-order representation 
z = Fz, essentially a compression of z. There are two basic perspectives on the 
formulation of a criterion for F: 
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1 . Find the linear transformation which captures the most statistical variability. For 
a single linear function z = £ z, the first principal component of z is the choice 
of £ which maximizes the variance of z: 

Find £ to maximize J(£) = var(z) = £ T P£ such that £ T £ = 1 (8.23) 

This generalizes to finding the first n principal components as the choice of trans- 
formation F which maximizes the overall variability of z 9 measured as the deter- 
minant of its co variance: 

Find F to maximize J(F) = det(cov(l)) = \FPF T \ such that \FF T \ = 1 

(8.24) 

2. Find the linear transformation which minimizes the mean-squared error in the 
reduced-order representation of z. 

Given the reducing transformation z = Fz, the estimator z which minimizes the 
mean- squared error in (z — z) is 

z = F T (FF T )~ 1 z 1 (8.25) 

clearly a function of F. The optimal choice of F is the one minimizing this MSE: 

Find F to minimize J(F) = MSE(| - z). (8.26) 

Both of the above criteria lead to the same conclusion: 

The principal components are given by the eigenvectors of P. That is, the 
optimum linear reducing transformation is found by letting the rows of F 
equal the eigenvectors corresponding to the q largest eigenvalues of P. 

Given the eigendecomposition 

Pv.i = \lU Ai > A 2 > . . . > A n > (8.27) 

where we define the reduction matrix 

^ = fci-..^], (8.28) 

then the principal-component representation of z is 

l=V?z, (8.29) 

where the mean-square representation error in keeping these first q principal compo- 
nents is given by the sum of the omitted eigenvalues 

n 

MSE(z-SFz) = MSE(z-V q Vj'z)= ^ A,-. (8.30) 

j=q+l 
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Fig. 8.3. Dimensionality reduction of stationary fields: a two-dimensional image Z is trans- 
formed into q decoupled problems Z. 
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Fig. 8.4. Dimensionality restoration of an estimated reduced-order model: after estimates have 
been computed for the q decoupled problems in Z the estimates in the original domain Z are 
easily found, inverting the procedure of Figure 8.3. 



If we consider z to be a lexicographically- stacked two- or higher-dimensional ran- 
dom field, then the size of z precludes applying an eigendecomposition, which 
has complexity 0(n 3 ), directly. However principal components can be applied 
to a single dimension in a multidimensional problem. In particular, suppose that 
Z = [^i z 2 . . . z n ] is a two-dimensional nx n random field, as shown in Figure 8.3. 
For dimensionality reduction to be applicable, Z and its associated inverse problem 
must satisfy two assumptions: 

1. The columns of Z must be statistically stationary, such that P — cov(zJ is not a 
function of z, meaning that the principal components will not vary from column 
to column. 

2. A given column of Z must be either densely measured, or not at all, to allow an 
explicit change of basis. 

Let V q be the gth-order reduction matrix, the q most significant eigenvectors of P, 
then 

Z = V^Z (8.31) 

is the reduced n x q two-dimensional random field, with the key property that the 
rows are decorrelated and can be processed separately. Because of the orthogonality 
of the eigendecomposition the inversion of the reduction step is very easy: 



z = v q z, 



(8.32) 
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Fig. 8.5. Dimensionality reduction in 3D: If the columns of a volume Z are statistically sta- 
tionary, we can use principal components to divide the volume into q decoupled 2D slices, 
each of which is solved separately, and then recombined to form the estimated volume Z. 



shown in Figure 8.4. Obviously the above development generalizes to reducing a 
(i-dimensional problem to a set of q separate (d — 1) -dimensional problems, as is 
illustrated for three dimensions in Figure 8.5. 
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Example 8.1: Principal Components for Remote Sensing 



We consider an example in three-dimensional remote sensing [232]. Suppose we 
have measurements of ocean temperature, sparsely sampled in location, but uni- 
formly sampled in depth: 




The ocean statistics are highly variable in depth, but approximately stationary in 
space, so we can consider taking principal components along the depth axis. To 
determine the number of components needed we examine the summed eigenval- 



ues: 




Number of Kept Components 

Four to ten components are enough to keep 95% to 98% of the signal energy; we'll 
keep five, which leaves us with five sparse, decoupled two-dimensional problems 
to solve: 




We will see methods for solving this estimation problem in Chapters 9 and 10, 
with the resulting estimates plotted in Figure 10.10 on page 346. 



8.2 Reduction of Basis 253 

The application of principal components is made tractable here in that it is applied 
to only a single column of state elements, not over the entire domain. The size and 
complexity of the problem is reduced as follows: 

Original Space Reduced Space 



# Problems Size Complexity # Problems Size Complexity 

2D 1 nxn G(n 6 ) q lxn G(qn 3 ) 

3D 1 n x n x n G(n 9 ) q nxn G(qn 6 ) 

Although each of the q decoupled problems is still a spatial statistical problem, the 
relationship of the (d — 1) -dimensional spatial models in Z to the d-dimensional 
model in Z is generally unclear. Even if Z were stationary in all dimensions, the q 
transformed problems would not be governed by a single model: a separate model 
needs to be derived for each of the q new decoupled problems. In general, rather 
than attempting to transform a model from Z to Z, it is normally simpler to learn the 
statistical models in Z directly, for example from simulated or measured data. 



8.2.2 Multidimensional Basis Reduction 

The previous section outlined the use of principal components in cases where a mul- 
tidimensional problem is stationary, allowing it to be decoupled into a number of 
uncorrelated, smaller pieces. 

In those cases where the model is spatially nonstationary, or where a decoupling into 
separate problems is inconvenient or undesirable, one alternative is not to divide the 
problem into multiple pieces, rather to formulate a reduced-order version of the full 
problem. That is, we wish to consider basis reduction: representing a large problem 
z G M n in a much smaller, reduced domain |GM 9 . 

As before, we consider a transformation between an original random field z and its 
reduced counterpart z, 

z = Fz z = Sz, (8.33) 

where F, S are subsampling/reduction and interpolating transforms, respectively. In 
principle, F and S may be space- variant, irregular operators. In practice such gener- 
ality is unneeded and overcomplicated for large random fields; instead, it is simpler 
to think of z as representing a low- or coarse-resolution version of z, such that z lives 
on a regular grid or lattice, organized in the same way as the lattice for z, but down- 
sampled by a factor 7 in each dimension. For large random fields we are unlikely 
to specify F or S directly; instead, it is simpler to suppose that the operations are 
generated by space-invariant kernels J 7 , S, such that each element in z essentially 
introduces a weighted copy of S in the corresponding parts of z. We can write this 
as 

Z = |(.F*Z) Z = <S*(TZ), (8.34) 
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Fig. 8.6. The transformation from a coarse representation, left, via zero-padding, middle, and 
convolution, right, with a Gaussian kernel S. 



where j and | represent subsampling and zero-padding, respectively. The interpo- 
lating kernel <S, illustrated in Figure 8.6, is the more intuitive of the two. Figure 8.7 
illustrates a two-dimensional example, in which a smooth random field is represented 
by a low-resolution "image", a 7 = 6 reduction to a 5 x 5 grid of extracted coeffi- 
cients. 

A few considerations for the boundaries of the domain of Z\ 

1. Although many mathematical functions have infinitely long tails, such as Gaus- 
sians, in practice the support of S must be of finite size, so long kernels must be 
truncated. 

2. It is often efficient to use the FFT (the fast Fourier transform) to compute convo- 
lutions, such as in (8.34). As the FFT actually implements a circular convolution, 
additional zero-padding may be needed to prevent a wrapping around the ends of 
the lattice, where the extent of zero-padding is proportional to the size of support 
of S. 

3. In most cases, the interpolating kernel needs to be different near boundaries than 
inside the domain. That is, the interpolator is not, in fact, stationary. However, 
rather than abandoning the kernel concept and specifying S explicitly, it is sim- 
pler to specify a nonstationary modification of the stationary convolution: 

Z= (5*(TZ))0(5*(T1)). (8.35) 

Specifically, this ensures that a reduced field of all ones interpolates to a fine-scale 
field of all ones. Many other normalizations are possible. 

So what sort of kernel T, S or associated induced transformation F, S makes an 
effective change of basis? 

Clearly an appropriate choice of basis will have to be a function of the statistics of 
the random field; although specialized approaches can be developed for particular 
statistics, here we limit our attention to the reduction of smooth random fields. Thus 
the interpolating basis elements, the kernel S or the columns of S, should be spatially 
smooth. There are three criteria to consider: 
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Fig. 8.7. We can represent a large random field, left, using a reduced set of coarse- scale coef- 
ficients, right. A Gaussian shape was used here as the interpolating kernel S. 



1 . PSEUDOIN VERSE: Choosing F and S to be a pseudoinverse pair (Appendix A.9) 
provides two benefits: 

• Choosing F = S + minimizes the representation mean-squared-error, 

• The pseudoinverse criterion guarantees that FS = I, meaning that there is no 
error induced by repeated coarse-fine-coarse projections. 

The complexity of pseudoinversion makes the relationship between T and S 
nearly impossible to specify. The analytic form of the Moore-Penrose pseudoin- 
verse [4] is well known, but computationally intractable for large problems: 



S = F+ = F T (FF T )- 1 F = S+ = (S T S)~ 1 S T . 



(8.36) 



Instead, the pseudoinverse should be represented implicitly. Either F or S may be 
specified, and the other inferred. For example, if we assume the interpolator S to 
be a given huge, sparse matrix, then we compute 



Q = S T S, 



(8.37) 



where the matrix multiplication S T S is straightforward because of the spatial spar- 
sity of S, and where the size of Q is the size of the reduced state. Then the pseu- 
doinverse is computed as 



Fz = S + z = Q-\S 



lfoT, 



(8.38) 



such that the matrix product Q l S is never explicitly calculated. 

In the event that Q is too large to invert, or Q _1 too large to store, z can be solved 
iteratively [111] from the linear system 



Qz = z s . 



(8.39) 
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2. Noise Insensitivity: A second desired property is that the transformations be 
stable with respect to errors. Consider, for example, the degree to which a pertur- 
bation S at the fine scale affects the coarse- scale coefficients: 

z -^ (z + 5) -^ z s (8.40) 

If S has a tiny singular value, then its pseudoinverse F must have a corresponding 
large singular value, implying that a small disturbance S could be amplified by F 
to give rise to arbitrarily large differences in (z — z 6 ), leading to a normalized 
noise sensitivity criterion 

\l-z 5 \\z\ _ \F6\\Sz\ (g41) 



1*1 HI " 1*1 HI ' 

The upper bound for this sensitivity is given by the product of the largest singular 
values a max of F and S, 

^max(^) * <w(S) = cond(F) = cond(S). (8.42) 

That is, the noise sensitivity is bounded by the condition number of the subsampler 
F, equivalently that of the interpolator S, implying that the noise sensitivity can 
be evaluated from either of F, S without computing a pseudoinverse. 

In general, an interpolating kernel S that passes through zero may not at all (or 
just barely) sample certain fine- scale elements, making the problem nearly sin- 
gular. A kernel robust to noise sensitivity should therefore be strictly positive, or 
must be used in a subsampling geometry which guarantees that kernel zeros of 
neighbouring coarse- scale elements do not coincide. 

3. Shift Invariance: A third desired attribute is that the basis-reduction operation 
be shift-invariant, meaning that the particular choice of origin for the coarse scale 
(that is, how the coarse lattice is arranged relative to the fine, as in Figure 8.7) 
should not lead to spatial variations in the quality or type of representation. In other 
words, if I can represent a fine-scale field Z(x),l should equally well be able to 
represent its shifted version Z(x — S), meaning that translation and representation 
commute: 

T SFz « SF Tz, (8.43) 

where T is a spatial translation on the fine scale. There are two motivations for 
this assertion: 

1 . It is undesirable for the coarse grid to manifest itself in any explicit way in the 
inversion Sz. 

2. In many cases the random field is actually time-dynamic z(t), but where 
the dynamics may be slow, involving shifts and motions much smaller than 
the coarse discretization interval 7. The sampling-interpolation operation SF 
should therefore be insensitive to spatial shifts to ensure that a slow, advective 
flow is not progressively corrupted by repeated sampling and interpolation. 
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Shift insensitivity, (8.43), implies that for any coarse field z_ and fine-scale shift 
operation T, there exists a new coarse field z T corresponding to the shifted field: 

TSz = Sz T (8.44) 

and thus, ignoring boundary effects, 

T(5*(TD) = (5*(Tsr)). (8.45) 

As convolution simplifies to multiplication in the frequency domain, taking the 
Fourier transform of (8.45) leads to 

F(S)F(U)F(T) = F(S)F(Ut)- (8.46) 

For this to be satisfiable for all z, F(S) must be mostly zero: essentially, the inter- 
polating kernel S must be bandlimited, or smooth. 

It is important to note that, in general, these latter two criteria are in opposition. That 
is, a highly band-limited kernel is typically poorly conditioned, and thus sensitive to 
noise, whereas a kernel corresponding to a low condition number is typically poorly 
bandlimited, and thus exhibits shift sensitivities. 

There is one significant exception to this opposition: if the domain has periodic 
boundary conditions and is stationary, then the Fourier transform offers a perfect 
pseudoinverse pair 



z = FFT 
z = FFT 



-1 (trunc(FFT(£))) (8.47) 

_1 (zeropad(FFT(l)) V (8.48) 



where truncation and zero-padding are frequency domain operations implementing 
an ideal low-pass filter. In general the rigid assumptions (stationarity and boundary 
periodicity) limit the usefulness of this approach, although we encounter the FFT 
again in Section 8.3. 

We conclude this section with a brief survey of possible interpolating kernels with 
respect to the condition-number and bandlimit criteria. At first glance the kernels of 
Figure 8.8 may appear to be superficially similar to the analytical statistical kernels 
of Figure 5.16 on page 161; it is crucial to understand a fundamental difference: the 
kernels V of Figure 5.16 represent a covariance, and therefore must satisfy positive- 
definiteness, whereas the kernels S of Figure 8.8 represent a transformation, which 
we would like to have well-conditioned and bandlimited, but in fact any transforma- 
tion is permissible. That is, one can arbitrarily experiment with different choices of 
<S, whereas most choices of V will fail to be positive definite. 

As a qualitative evaluation, Table 8.1 plots the shift sensitivity and condition num- 
ber for the eight kernels of Figure 8.8. It must be understood that these numbers 
are illustrative only, as they can vary greatly with problem size and geometry. Two 
observations: 
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c -(r/«) J 

Gaussian 



^f^. ^ II 



e -(r/€) 

Nonseparable Exp. 



Cone-shaped 




sinc(r/£) 
Nonseparable Sine 



sinc(x/£) • sinc(?//£) 
Separable Sine 



Negative-lobe 




(see Figure 5.8) 
Thin-Plate 




Fig. 8.8. Eight plausible interpolation kernels <S; in all cases r = ^Jx 2 + y 2 measures the 
distance to the origin and £ is a scale parameter that controls the spatial size of the interpolator. 



1. The kernel of a perfect low-pass filter is the separable- sine function, so it is in- 
teresting to see the significant difference in behaviour between it and the FFT. 
The differences stem from the boundary periodicity of the FFT and the very slow 
decay of the sine kernels: the finite domain tested in Table 8.1 has non-periodic 
boundaries, and for computational reasons the kernels are truncated to a finite 
size, particularly problematic for slow-decay kernels. 

2. The Gaussian kernel is unique, maximally bandlimited simultaneously in space 
and frequency, thus well- approximated as a truncated kernel (space bandlimit) 
and giving excellent shift- sensitivity (frequency bandlimit). 



A remote- sensing application of multidimensional basis reduction is shown at the 
end of this chapter on page 285. 
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Kernel S 



log Shift Sensitivity log Condition Number 



Gaussian 


0-2 


2-4 


Smooth 


3 


1-2 


Thin-Plate 


3.3 


1 


Nonseparable Exponential 
Negative lobe 


3.5 
3.5 


1 
1-2.5 


Separable Sine 
Nonseparable Sine 
Cone-shaped 


3.5 

3.8 
3.8 


1 

1-2 

1 



FFT (periodic separable sine) — oo 

Table 8.1. Shift sensitivity and condition number, evaluated numerically for the eight kernels 
of Figure 8.8onal0 x 10 two-dimensional domain. 



8.2.3 Local Processing 



A final choice of basis reduction is the relatively simple approach of local processing, 
in which only a small portion of the overall problem is solved at a time, and the 
results stitched together. Clearly the computational savings can be substantial; for 
an algorithm with cubic complexity, dividing a problem into q equally sized pieces, 
each of size 1/q and complexity 1/g 3 , results in a computational complexity 1/q 2 as 
great as solving the full problem directly. 

Obviously this sort of approach admits many variations, a few of which are illustrated 
in Figure 8.9. These approaches are, in general, most applicable to estimation, and 
less so to sampling, because measurements will tend to cause the estimates in adja- 
cent blocks to be similar, whereas there is nothing to force any consistency between 
blocks in the stochastic (random) variations in sampling. 

The overlapped approach [169] is a significant special case: because the estimate 
of a single state element Z{ is best estimated based on measurements around i, and 
not just at z, it is preferable to divide the problem into overlapping blocks, allowing 
adjacent blocks to be interpolated, limiting inter-block artifacts and discontinuities. 

The overlapped approach is, itself, essentially a projection onto a new basis, this time 
a projection into a redundant domain, as some state elements will belong to multiple 
overlapped blocks: 

F 



(8.49) 



such that, as usual, SF 
uniquely define F: 



I. The size, number, and degree of overlap of the blocks 
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Consider solving an estimation problem by processing local regions separately. We illustrate 
the process graphically using the following notation: 























Measurements Used Estimates Computed Estimates Used 

First we can divide the problem into disjoint pieces, stitching together the individual estimates 
to obtain the overall result: 



The stitched results will tend to be blocky, because there is nothing to force continuity at 
the region boundaries. Because a pixel on the edge of a region would benefit from nearby 
measurements outside of the region, we can extend the range of measurements used: 



In some cases it is inconvenient to use measurements at locations which are not estimated. We 
could, instead, estimate over a larger region, but only keep the more local estimates: 



M 


i 




\\\\ N 







Having generated estimates in overlapping regions, why not take advantage of them? We can 
synthesize the final result as a tapered interpolation from one region to the next: 



"l 



Fig. 8.9. Basis reduction by local processing: We can solve a larger problem by dividing it 
into separated subproblems. The overlapped approach, bottom, performs no spatial blurring; 
any interpolation is between multiple estimates at a single location. 
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1 if z_ { corresponds to element Zj 
otherwise 



(8.50) 



S is not uniquely defined, as there are many possible ways to interpolate the redun- 
dant state elements. A spatial linear interpolation from one block to the next [169] is 
a simple, intuitive choice. The resulting estimates are computed as 



F 



Divide 



(U 



Local 



m 



Combine 



Estimation 

(8.51) 

It is important to understand that operators F and S do not blur or smooth; they are 

pointwise operators. The overlapped results in Example 8.2 give the appearance of 

having been blurred, however the smoothness is due to the spatial constraints on z_ in 

the overlapped domain; there is no spatial averaging. 



f Example 8.2: Overlapped Local Processing 



We wish to consider local estimation, as in Figure 8.9. Suppose we have sparse 
satellite temperature measurements of the eastern equatorial Pacific ocean. We 
will propose a very simple local estimator: we set the estimate of all pixels in a 
block equal to the mean of the measurements which lie within it. 




Sparse Data Points 



Disjoint-Block Estimates 



As was discussed in Section 8.2.3, and is very clear here, local estimation based on 
disjoint blocks suffers from inter-block artifacts (discontinuities). To address this 
we can define overlapped blocks, such that each block is processed independently, 
but the resulting estimated image is found as a tapered interpolation. The block 
size and inter-block overlap must satisfy 

Image Size = # Blocks • Block Size - (# Blocks - 1) • Overlap. (8.52) 



Example continues . . . j 
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Example 8.2: Overlapped Local Processing (cont'd) 



As all of these quantities are integer only certain values are permitted, most easily 
found by testing (8.52) numerically for a wide range of integer possibilities. For 
the 512 x 512 image being processed here we select two cases: 

Image Size # Blocks Block Size Overlap (pix) Overlap (%) 



512 
512 



25 
41 



32 
32 



20 



37% 
63% 



leading to the following results: 





37% Overlap 



63% Overlap 



8.3 FFT Methods 



A very special change of basis is the Fourier basis for stationary, periodic random 
fields [83,276]. In particular, the Fourier basis elements are the eigenvectors, and 
thus the perfect change of basis, for every stationary, periodic field. Although such 
stationary, periodic fields may be rare in practice, the methods available to solve 
them are so efficient and elegant that some discussion is merited. 

A finite, one-dimensional stationary random process with periodic boundary condi- 
tions is known as circulant, meaning that the points of the process can be visualized 
as lying on a circle: no beginning, no end. Similarly in two dimensions, a stationary 
random field on a rectangular lattice with periodic boundary conditions is known as 
toroidal, 5 as implied in Figure 8.10. 



The correlation structure of a (i-dimensional stationary, periodic ni x • 
field therefore takes the form 



E 



z.z 



(i+8_) mod ? 



= E 



Z °^ Z 5modi 



= v s 



• x rid random 



(8.53) 



where 



That is, topologically, the wrapping of a rectangular sheet onto a torus or doughnut. 
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Fig. 8.10. Illustration of periodic, stationary random processes in one and two dimensions. In 
one dimension the problem maps to a circle, and in two dimensions to a torus, such that the 
point pairs (a a), (b 6), (c c) all have the same joint statistics. 



i mod n = [(ii mod ni ),..., (id mod n^)] . 



(8.54) 



If the process is lexicographically ordered, the covariance of the resulting random 
vector will have a special circulant or block-circulant structure [83], as illustrated 
in Figure 8.11. The special significance of any such process is that its covariance is 
diagonalized by the d-dimensional FFT. 



8.3.1 FFT Diagonalization 



Define the Fourier basis element of length N as 



& = 



-j2irO/N 



-j2ir(N-l)/N 



(8.55) 



We first consider the diagonalization of a one-dimensional random process z with 
circulant covariance 

z~P= [ Uo . . . g^] . (8.56) 

The Fourier transform, or FFT, applied to £o is 



FFT(£ ) = F m 



(&) 



N-l 



2o =2o- 



(8.57) 
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One-dimensional circulant covariance 



Two-dimensional toroidal covariance 



Fig. 8.11. Examples of circulant and toroidal covariance matrices [83]. In one dimension, 
each row or column is equal to the previous row or column, rotated by one position. In two 
dimensions, the lexicographical reordering of the 2D process to a one-dimensional vector im- 
plies that the covariance has a block- circulant structure (from the stationarity/periodicity of 
the 2D-process rows), where each block A, B,C,D is itself circulant (from the stationar- 
ity/periodicity of the 2D-process columns). 



Because P is circulant, ^ is just £ rotated downwards (Figure 8.1 1) by i positions, 
therefore the standard circular- shift property [244] of the Fourier transform applies 



FEi = 2o © 



-j2ttO/N 



-j2tt(N-1)/N 



■2 G 



(&)'. 



(8.58) 



Therefore the Fourier transform applied to the whole covariance can be factored as 

v / x(iV-l) 



FP 



2o®(1 T n) --2o®(1 T n) 
= &■ [1-1]) © [iSr ■ ■ ■ iSf" 1 ] •" 



(8.59) 
(8.60) 



N times F T =F 

Therefore the covariance of the transformed random field is 

zr^p=^Fz^P = FPF H = (fa • [1 ... 1]) f)f h 

= Diag(£ ). 



(8.61) 



That is, the FFT diagonalizes the covariance associated with any stationary, periodic 
random vector. Stated another way, the eigenvectors for all circulant matrices are the 
Fourier basis elements, and the FFT of the circulant matrix returns the eigenvalues. 
Thus 

z~P Circulant => FFT(z) - Diagonal. (8.62) 

Given a two-dimensional toroidal random field Z, each row or column of Z is circu- 
lant, so an FFT applied to the columns of Z 



Z = FFT(Z) = [FFTfeo) • • • FFTfe^)] 



(8.63) 



has uncorrelated rows. As the FFT is a stationary, periodic operator, each row of Z 
is still circulant, so a second FFT applied to the rows 
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1= (FFT(Z T )) T (8.64) 

decorrelates the elements within rows. Therefore all elements in Z are decorrelated 
from all others, meaning that the associated covariance is diagonalized. As taking 
an FFT by columns and then by rows is equivalent to the two-dimensional FFT, we 
conclude that 

[Z]. ~P Toroidal => [FFT 2 (Z)] : - Diagonal. (8.65) 

Finally, by induction we generalize to the d-dimensional case. If Z is a random field 
of dimensions ri\ x • • • x rid, then 

Z d-Dimensional, Stationary, Periodic => [FFT^(Z)]. ~ Diagonal. (8.66) 



8.3.2 FFT and Spatial Models 

It is clear that a covariance P is an inefficient representation of large, multidimen- 
sional random fields. The inefficiency is particularly striking with stationary, periodic 
fields, as the entire covariance P = [eoEi • • •] can be reconstructed from the first 
column 2o alone, since 

[P] : =2o (8.67) 

as was discussed in Section 5.3.2. We start with the covariance P in the transformed 
domain, from (8.61): 

P = Diagfe) = Diag(F d £ ) = Diag([FFT d (P)] : ) . (8.68) 

Now, consider finding the inverse covariance P _1 : 

P = F d PFf => p- 1 =Ffp- 1 F d . (8.69) 

The matrix inversion is easy, as P is diagonal. However, because circulant/toroidal 
nonsingular matrices have circulant/toroidal inverses, the explicit construction of 
P _1 makes no sense, rather we really want P _1 , the kernel corresponding to P _1 : 

Bo = F?So =^ ^ = FFT- 1 (fe] IL )=FFT- 1 (P) (8.70) 

=> V' 1 =FFT- 1 (10P) =FFT- 1 (l0FFT d (P)), (8.71) 
where 

1 . Vector n = [ni ri2 . . . rid] represents the size of the (i-dimensional domain being 
considered, 

2. Matrix inversion has been simplified to reciprocals in the transformed domain, 

3. All operations are directly performed on kernels. 
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Thus we have 

<p-i = FFT^ 1 (l0FFT d (7>)) V = FFT" 1 ^ FFT^p- 1 )). (8.72) 

If, from Chapter 5, we recognize V~ x to be a constraint or GMRF model kernel, then 
we have derived a significant result: for periodic random fields we have an efficient 
means of converting between a model and its associated correlation structure. For 
example, this FFT approach was used to compute the correlation length (related to 
V) as a function of thin-plate model (P _1 ) in Figure 5.15 on page 158. 

To complete the discussion, note that 

det(P) = n^( p ) = II diag(P) = ]j¥¥T d (V). ( 8 - 73 ) 

i 

If we are calculating log-likelihoods, for example for parameter estimation, then the 
log-determinant can be calculated very efficiently as 

log(det(P)) = ^log(FFT d (7>)). (8.74) 

8.3.3 FFT Sampling and Estimation 

Given a prior model V or P _1 (the distinction is unimportant, as (8.72) allows an 
easy conversion) we wish to generate random samples from the prior model: 

z = P 1/2 w w~AT(0,I) (8.75) 

= F H P 1/2 Fw (8.76) 

= F H Dmg(2 ) 1/2 Fw. (8.77) 

Removing the lexicographic ordering on both sides of (8.77) leaves us with a simple, 
fast sampler: 

Z = FFT" 1 (y/FFT d (P) ¥¥T d (W)^j , (8.78) 

where W is a white random field of unit- variance random values, and where W and 
V are the same size as Z. 

We hope the pattern has become clear: because all circulant matrices are diagonalized 
in the transformed domain, operations of matrix multiplication and inversion become 
corresponding scalar operations on the transformed kernels. Thus the equations for 
least-squares estimation and posterior sampling follow by inspection. Suppose we 
are given a fully stationary estimation problem, meaning that the prior P, observation 
matrix C, and observation noise covariance R are all circulant/toroidal. Let 

P = ¥¥T d (V) C = FFT d (C) K = FFT d (ft) (8.79) 
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Algorithm 3 FFT Estimation and Sampling 



Goals: Compute estimates, given measurements M and model C,7Z,V 
Function Z = FFT_Estimate(^, C, K, M) 

d *— ndims('P) Get number of dimensions 

Vf <— FFTd(V) Diagonalize model 

C/<-FFT d (C) 
K f 4- FFT ' d (K) 



real 



FFTdM^/ -*CJ./{C f .*V f .*CJ+Kf)) .*FFT d (M) 



Goals: Compute a random sample, size n, from a stationary, periodic prior kernel 
Function Z = FFT_Sample(T, n) 

W <— randn(n) Generate white random field 

d <— ndims('P) Get number of dimensions 

if d / length (n) then 

Err or ('Inconsistent kernel and sample dimensions') 
end if 



real 



FFT" 1 f sqrt (real(FFT d (P)) .* FFT d {W)) 



Goals: Find the matrix inverse of a stationary, periodic kernel 
Function A' 1 = FFTJnverse(^l) 



d <— ndims(^4) 
A' 1 <- real 



Get number of dimensions 



FFI7 1 (l./real(FFT d (.4))) 



then from (3. 1 12), and ignoring prior means, we have 

Z = FFT^ 1 {(PqC t (CqPqC t + K)) 0FFT rf (M)} 
V = FFT" 1 {P-PqC t 0(CQpQC T + K) QCqP}. 

The posterior sampler follows from V in the same manner as (8.78): 



(8.80) 
(8.81) 



(Z\M) = Z + FFT^ 1 (y/FFT^P) FFT d (W) J . (8.82) 
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Example 8.3: FFT and Spatial Statistics 



We use the FFT method to examine the thin-plate prior model of Figure 5.8: 

1 

2 -8 2 

1 -8 (20.001) -8 1 

2-8 2 

1 

To use this kernel as the prior to an N x N image we need to embed the above 
values in an N x N periodic kernel Q = T~ x , with the kernel origin in the upper 
left corner, from which the correlation kernel is easily found via (8.72): 




Thin-plate model kernel V 



Thin-plate correlation kernel V 



With the correlation kernel in place, it is trivial to use (8.78) to generate random 
prior samples. Because of the efficiency of the FFT, generating very large random 
fields (right) is easy. The left sample shows particularly clearly the periodicity 
(top/bottom and left/right) of the field. 




Low-resolution prior sample 



High-resolution prior sample 



8.4 Hierarchical Bases and Preconditioners 269 

8.4 Hierarchical Bases and Preconditioners 

The previous sections have examined basis reduction, to reduce problem size, and 
the highly specialized approach of using the FFT for problem diagonalization. 

We now return to the question of whether a general-purpose change of basis can be 
found, appropriate for spatial estimation. We recall from Section 8.1 that for sparsely 
measured domains an explicit change of basis may not be appropriate, rather an 
implicit reformulation of the estimation problem is found by transforming the linear 
system 

z = (P- 1 + ^R^Cy 1 CFR- X m => (P' 1 + C T R~ l C) z = ^R^m 

=>> Az = b (8.83) 

by a change of basis z = Sz\ 

Az = b^ S T ASz = S T b ^Az = b 



Eas Y ? (8.84) 



Recall from Figures 8.1 (page 242) and 8.2 that the poor conditioning of A, especially 
in our context of random fields, stems from interaction locality. 

In principal components, a change of basis was inferred from the problem co vari- 
ance, but made no assumptions regarding the arrangement of state elements in z. 
Now, in contrast, we wish to explicitly recognize that z represents a spatial problem 
with the assumption that nearby pixels interact strongly. Can we introduce a change 
of basis S in which the basis elements z are nonlocal, allowing spatially- separated 
state elements to be coupled? 

Even in the spatial case, the perfect change of basis remains the eigendecomposi- 
tion which will, in general, have highly nonlocal basis vectors. However, we do not 
want to precondition with a set of highly nonlocal vectors, because it is unlikely that 
we will happen to select something that resembles eigenvectors, and indeed highly 
possible that we may create a set of coefficients even more correlated than before. 

That a nonlocal basis z_ is not, on its own, enough to necessarily improve the condi- 
tioning of A over A can be seen in Figure 8.12, which illustrates the effect of five 
preconditioners on the conditioning of a first- and a second-order system. The local 
averaging operator, in which each element in the transformed space is a smeared or 
locally-averaged version of the original space, is a nonlocal operator, however the 
conditioning was made worse for the first-order test case. 
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Ai = {C T R- 1 C + \L T L) for First-Order L A 2 = {C T R- 1 C + \L T L) for Second-Order L 




• o 

o • 

o o 



Ideal Preconditioner for A\ 



Ideal Preconditioner for A 2 



The columns (eigenvectors) in the ideal preconditioners are weighted by the eigenvalues to 
show significance. For a given change of basis matrix S, we can compute the problem condi- 
tioning as m — K,(S T AiS): 




No Preconditioning Local Average Weighted Average Hier. - Wavelet Hier. - Triangular 

ki =39 k\ = 154 ki = 26 K\ — 8 k\ = 5 

^ 2 = 1485 k 2 = 939 k 2 = 3 ^ 2 = 284 ^ 2 = 66 



Fig. 8.12. Illustration of preconditioning: Each panel plots the sparsity structure of the corre- 
sponding matrix, with circle size related to magnitude, and the filled/unfilled state related to 
sign. It is clear that problem conditioning can change greatly as a function of preconditioning, 
although the best choice of preconditioner will vary with the problem. 
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Let us briefly consider the most obvious forms of nonlocal behaviour. As is very 
familiar from signal analysis [244], there are two extremes of local / nonlocal be- 
haviour: 

Impulse: 



freq 



Sinusoid: 



Creq 



Because we are trying to find basis vectors which are not purely local, but at the same 
time not highly nonlocal, we are motivated to consider vectors at a more intermediate 
scale, localized in both the time and frequency domains: 

Sine: 



fre( i 



Gaussian: 



freq 



The latter Gaussian function is particularly interesting for being local in both the 
spatial and frequency domains. 

It is possible to create bases using shifted versions of all of the above functions, how- 
ever empirically it has been found that effective preconditioners need to introduce a 
range of nonlocality, such that both local and nonlocal effects can be effectively rep- 
resented in the transformed space. Unfortunately, it is not obvious how to take shifted 
versions of the above functions at various scales to make a multiresolution basis. 

Hierarchical systems offer precisely such a range of nonlocality, and two such hier- 
archical bases are discussed in the following sections. 
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Net Interpolant S 



So 



Fig. 8.13. A hierarchical interpolant S can be built up as the product of local interpolations 
Sj over scales. Here So interpolates the midpoint from the two endpoints, with progressively 
local interpolants in Si and S2 • 



8.4.1 Interpolated Hierarchical Bases 



The simplest approach to creating a basis having a range of scales is to develop a 
hierarchical interpolator. Precisely such a method comes out of the preconditioning 
literature [347, 348] and has been used for multidimensional surface estimation [298, 
299]. 

If we are wishing to specify a hierarchical interpolator, it is cleaner, more intuitive, 
and possibly much simpler to construct S hierarchically, rather than as a single trans- 
formation. Therefore our goal is to write the transformation as 



z = Sz = SjSj- 1 ---S 1 S z 



(8.85) 



over J scales, as illustrated in Figure 8.13, such that Sj represents the interpolation 
at the jth scale. At each scale, the new nodes introduced at a scale are interpolated 
from nodes at the coarser scale, a simple operation which leads to very sparse and 
simply- structured Sj. 

The resulting change of basis for a three- scale representation of a one-dimensional 
process gives a total of nine basis elements, implied by the columns of S in Fig- 
ure 8.13, and plotted graphically in Figure 8.14. 

We do not, however, wish to specify S explicitly: for a N x TV domain, the matrix 
S will be N 2 x TV 2 , a very large matrix to store, even if sparse. As in the lengthy 
discussion of sparse kernels in Section 5.3.2, a similar conclusion applies: we should 
specify S implicitly, via an interpolation algorithm 



so that the net transformation is computed as 

z = Interp ( Interp(. . . Interp(z, 0) . . . , J — l) , J j . 



(8.86) 



(8.87) 



8.4 Hierarchical Bases and Preconditioners 



273 



Finest Scale 



Mid Scale 



Coarse Scale 




Fig. 8.14. The hierarchical interpolator S in Figure 8.13 is composed of nine interpolants 
(the columns of S) at different scales. If local interpolation is used in Sj, then the resulting 
interpolants in S are triangular, as shown. 



The implicit, algorithmic formulation of (8.87) facilitates the generalization of this 
approach to methods other than one-dimensional triangular. In particular, the interpo- 
lation may be chosen to be linear or nonlinear (cubic etc.); similarly if z is understood 
to be a multidimensional field, organized lexicographically, then the interpolation can 
just as easily be bilinear or bicubic. 

The only disadvantage of this interpolative method is the need to specify and imple- 
ment it in the first place. Given the widespread availability of implemented wavelet 
transforms, we are highly motivated to consider wavelets, discussed in the following 
section. 



8.4.2 Wavelet Hierarchical Bases 



A multiresolution decomposition [221,293] is one in which an interpolator 

A 3 : Voo -► Vj (8.88) 

maps from infinite spatial resolution Voo to a vector space Vj at resolution j. Under 
the conditions that 



1. A coarse resolution signal is representable at finer resolutions: 

VjCV j+1 C--- 



(8.89) 



2. The subspaces are stretched by a factor of two: 

fit) e Vj <=* f(2t) e V j+1 . 



(8.90) 



3. Shift invariance 
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Haar Wavelet ip(t) Haar Basis, {ip(2 j t - n)} 

Fig. 8.15. The Haar wavelet basis: The single Haar wavelet function, left, forms a basis when 
shifted and dilated, right. 



then 6 there exists a unique function <j> such that 

{2 j (j)(2H -n)} (8.91) 

is a basis for Vj . However this still leaves us with a basis in which all elements are 
at one scale, like a set of shifted sine or Gaussian functions. Furthermore, we cannot 
just combine basis elements from different scales j, since that may not leave us with 
a basis. 

Instead, the key idea with wavelets is the following: 

Suppose we have a basis for a coarse resolution. What new information do 
we need in order to represent a finer scale? 

Let us seek to define a vector space Wj such that 

Vj+i = V J ® W J V J - 1 W J ( 8 - 92 ) 

meaning that 

Basis for coarse scale Vj \ _ . » fi , T . /om , 

„ . j. . . , A ., J TTr > Basis for finer scale \/ 7+ i. (8.93) 

Basis for missing details Wj J J 

Under the conditions [221] for the multiresolution decomposition, there exists a 
unique function ip such that shifted and dilated versions of ip 

{2 j ^(2 j t-n)} (8.94) 

form a basis for Wj, as is illustrated in Figure 8.15 for the simple case of the Haar 
wavelet. By extending the decomposition of (8.92), 

Vj+i = V j © W j ( 8 - 95 ) 

(8.96) 
(8.97) 



Vj®Wj 






(y j - 1 ®w j - 1 )®w j 






Vj- q ® Wj- q ©••• 


© 


Wj , 


Very Coarse Coarse 




Very Fine 


is that Uj Vj is dense and D 


jVj 


= {0}. 
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Fig. 8.16. The space (or time) and frequency domains can be carved up in various ways. The 
figure shows five possible representations of eight coefficients. The wavelet approaches are 
unique, offering high-resolution coefficients in both the temporal and the frequency domains. 



we observe that we have met our goal of expressing a basis for Vj+i in terms of basis 
elements at all scales. The resulting wavelet basis essentially represents a tradeoff 
between the time and frequency representations, as illustrated in Figure 8.16. 

The generalization of wavelets to multiple dimensions is straightforward, both for 
computational and implementation reasons. The 0(n) complexity of the wavelet 
transform is fast, as opposed to the 0(n log n) complexity of the fast Fourier trans- 
form, and is therefore applicable to exceptionally large problems. Furthermore, for 
orthogonal wavelets, the multidimensional transform can be computed by taking 
one-dimensional transforms along each dimension, as with the Fourier transform. 

The basis change for a multidimensional, lexicographically ordered z is thus easily 
accomplished by using the corresponding multidimensional wavelet transform [345] 



WT(I). 



(8.98) 



The orthogonality of the wavelet transform, like the Fourier transform, means that 
the inverse and adjoint (transpose) operators are equivalent, such that the product Az 
from (8.7) is easily computed as 



Az-- 



S T ASz 
WT -1 ^ 



WT(z 



(8.99) 
(8.100) 



8.4.3 Wavelets and Statistics 



The great many choices of wavelets and the computational efficiency and hierarchical 
representation of the wavelet transform have led to extensive study, including some 
work on wavelet statistical properties [237, 317]. 
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Example 8.4: Hierarchical Bases 



Suppose we consider a one-dimensional, second-order interpolation. We will ap- 
ply a simple, iterative approach (Gauss-Seidel, of Section 9.2) solution to the re- 
sulting linear system. The second-order (thin plate) problem is badly conditioned, 
such that after one thousand iterations, almost no progress has been made towards 

the desired solution: 

No Preconditioning 






Spatial Element Index 



In contrast, the hierarchical preconditioned problems converge much faster. Be- 
cause thin plate implies a rather smooth prior, the smoother bases (triangular, 
DB4) converge more rapidly, with DB4 mostly converged in only ten iterations: 



Hierarchical Triangular 



Haar Wavelet (DB1) 




Meas. Exact Solution 


--^0-- - 





Spatial Element Index 

Daubechies Wavelet DB2 



Spatial Element Index 

Daubechies Wavelet DB4 








Exact Solution 
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Spatial Element Index 



Spatial Element Index 
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Given a random field or process z ~ P, the transformation of this process by a 
wavelet transform leads to the modified statistics 

z = WT(z) = Wz => z~ P = WPW T . (8.101) 

In many cases, certainly for many smooth z but also for fractal fields [117], P is 
assumed to be nearly diagonal, meaning that z has been decorrelated. Indeed, that 
the wavelet transform nearly whitens many signals is precisely the rationale for its 
use in preconditioning and changes of basis. 

Because images are usually densely measured, the near- whitening property of the 
wavelet transform has made it popular as an explicit change of basis in image pro- 
cessing. If a process is densely measured with white additive noise, 

m = Iz + v v~al (8.102) 

then the whole problem can be transformed, using an orthogonal wavelet W, into the 
wavelet domain: 

m = Wra, z = Wz, v = Wv => m = Iz_-\- y_ v^crl (8.103) 

such that the transformed problem has two particularly convenient properties: 

1. The noise v remains white, because of the orthogonality of W. 

2. The prior model for z may be chosen to be diagonal. 

Therefore the resulting estimation problem is pointwise, without spatial interactions, 
which has led to a wide variety of wavelet methods for image estimation [1 14, 1 15]. 

The whitening behaviour of most wavelets implies that the wavelet transform of real 
images is sparse, meaning that most of the coefficients are near zero. This observation 
has led to a variety of image-compression methods [302]. Similarly, the sparsity of 
the coefficients has allowed the statistical behaviour of the nonzero coefficients to be 
studied, particularly for basic fundamental image shapes (steps, edges, etc.), which 
has led to methods for wavelet-based image resolution enhancement [262] . 

The marginal statistics for z f are not, however, Gaussian. For two-dimensional pic- 
tures, the marginal statistics of z' are more strongly concentrated at zero than a Gaus- 
sian, and the wavelet coefficients have been modelled as Rician, generalized Gaus- 
sian [57], and Gaussian scale mixtures [263] among others. Because the marginal 
statistics are no longer Gaussian, the optimal Bayesian estimator is nonlinear, which 
has led to a wide variety of nonlinear and thresholding approaches to image denois- 
ing [56, 87], a simple version of which is explored in Problem 8.5. 

The wavelet transform is not, of course, a perfect whitener. The spatial correlations 
have been studied [13, 306], although more success has been had by studying the 
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parent-child relationship of coefficients [77, 149,262,274], which led to models such 
as the Hidden Markov Tree, which was illustrated in Figure 7.7. 

The hierarchical nature of wavelets also makes them a natural choice for the analysis 
and representation of power-law or 1/f processes [339, 340], and which has been 
used as the basis for estimating power-law fields [105]. 



8.5 Basis Changes and Markov Random Fields 

Given the interest throughout this text in Markov random fields and sparse models, 
it is worthwhile asking how such models are affected by a change of basis, and 
particularly by a hierarchical change. 

In general, given a sparse random field Z, such that 

[Z].=z~P P- 1 sparse (8.104) 

then if this field is operated upon by some change-of-basis operator F, the trans- 
formed field 

z = Fz~ P P- 1 = (FPF T ) _1 = F-Tp-t-F- 1 , (8.105) 

where we assume that F is square and invertible. Since it is F which would normally 
be specified in a change of basis, and not F -1 , it is exceptionally unlikely that F~ l 
would in any way be sparse, therefore almost certainly the sparsity of P _1 has been 
lost in P" 1 . 

This loss of sparsity is not necessarily tragic, however, since the transformed system 
may not need to be written down explicitly or, in the case of a reduction of basis, 
the transformed system may be sufficiently small that a sparse representation is not 
needed. Recalling the implicit change of basis from (8.84), we had the transformation 
of a linear system 

Az = b, (8.106) 

where A is sparse, to the transformed system 

S T ASz = S T b ^=^ Al=b, (8.107) 

where A is now dense, as argued in the previous paragraph. However, we do not 
necessarily need to store A; instead, the matrix-vector product Az is calculated as 



AZ: 



1 (^A(Sz)y (8.108) 



That is, we require only the ability to compute the three matrix-vector products 
S T x, Ay_, Sz to preserve the benefits of the sparsity of A. 

The above discussion dealt with changes and reductions of basis in general. However, 
there is a much more interesting, specific question: 
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Given a Markov random field Z, iflsubsample the random field or represent 
it at a reduced resolution, is the resulting field still Markov? 



Let us start with a few special cases. If our random field Z is zeroth-order Markov 
(white noise), then any orthogonal resolution reduction will result in a zeroth-order 
Markov field, since 



z = Fz~ FIF 1 =FF 2 =L 



(8.109) 



However the absence of any spatial interactions in the zeroth-order case makes it 
of limited interest. Certainly if Z is first-order Markov, such that the columns of Z 
satisfy 

^LLSE^i-iUi, Z i+1 , Z i+2 , ...] =£lLSe[^_iU;] (8.110) 

then it follows that 



EujsE[z i _ 2 \z i , z i+2 , z i+4 , ...] = Eu L sE[z i _ 2 \z i ], 



(8.111) 



meaning that if we subsample the columns of Z, the resulting random field is still 
column-wise Markov, although row-wise the Markovianity will (most likely) have 
been lost. 

The loss of Markovianity in downsampling stems from the fact that the exact, fine- 
scale state values are lost, and replaced with functions (e.g., averages) of those state 
values. Consider, for example, a Markov process with four state elements 



: pT|p2" 



Z3 



Z 4 



(8.112) 



Suppose the process is first-order Markov, meaning that either z 2 or Z3 can condi- 
tionally decouple z\ and 24. The process prior inverse could look something like 



P- 1 = G = 



1.2 


-1 








-1 


2.2 


-1 








-1 


2.2 


-1 








-1 


1.2 



(8.113) 



Those entries with the grey background highlight those parts of G which must be 
zero in order for the prior to be first-order Markov. 



Now suppose we modify the process by averaging the middle two elements: 
(Fz) T = t 



z\ 



Z2+Z3 
2 



z 4 



(8.114) 



where F is a 3 x 4 array. The new middle element contains the information from both 
neighbouring values in the middle of the original process; we might naively suppose 
that 
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Fig. 8.17. The panels show two examples of downsampling two-dimensional MRFs. Random 
fields corresponding to both kernels are created in a 2D domain, which is subsampled by 
2x2 block averaging. Although the subsampled field is no longer Markov, it is clear that 
the subsampled field is very nearly Markov, based on the sparsity of the subsampled kernel. 
It should come as no surprise that the kernel values have changed, since subsampling affects 
the random field variance and correlation structure. Both priors illustrated here are isotropic, 
therefore only one quadrant of the kernel is shown, with the kernel origin in the upper-left 
corner. 



Z2 +^3 



represents a boundary of thickness two, and therefore easily conditionally separating 
z\ and z^. However, if we examine the prior inverse of z, 



(FG^F 1 



1 
-1 


-1 ( 
2.4 


).16 


-1 


0.16 


-1 


1 



(8.115) 



we find that this process is, indeed, no longer first-order Markov, by the presence of 
nonzero values in the greyed entries. We do notice, however, that the greyed entries 
are small. Indeed, Figure 8.17 shows the kernels which arise from downsampling 
two-dimensional MRFs; the downsampled fields, although not precisely Markov, are 
very nearly so. In many cases it may be very reasonable to approximate a downsam- 
pled MRF as Markov. 

It is possible to consider the question of constructing exact Gibbs/Markov process at 
multiple resolutions [136], however the results are very complicated, in part because 
the downsampled field is not Markov and therefore gives rise to nonlocal cliques 
and neighbourhoods. The question of downsampling and subsampling MRFs has 
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(a) ID Ising Process (b) Markov Chain for the Random Walk in k 

Fig. 8.18. In a one-dimensional Ising process (left) the boundary, dashed, between two state 
values will move as a random walk (right), until the optimum is reached when k = 0. The 
details of the transition probabilities in the corresponding Markov chain will depend on the 
details of the sampling algorithm. 



been examined in detail in [199], and also in [146, 197, 199,254]. There has also 
been considerable work in performing image operations, such as segmentation or 
denoising, using hierarchies of Markov random fields [35,58,146,186,234,334,341]. 



8.6 Basis Changes and Discrete-State Fields 



Most discrete- state models, such as those discussed in Section 7.4, are local Markov / 
Gibbs, and therefore are subject to the same issues of indirection as local continuous- 
state models, as was discussed at the start of this chapter and illustrated in Figure 8.1 
on page 242. 

Although the same notion of ill-conditioning does not apply to discrete- state random 
fields, the slowness of convergence seen in Figure 8.2 applies equally to the discrete- 
state case, as was discussed in the image segmentation illustration of Application 7 
on page 232. 

Figure 8.1 8(a) illustrates the locality principle for the discrete- state Ising model (Sec- 
tion 7.4.1). Given a one-dimensional binary process, with k elements of value —1 
followed by an indefinite sequence of +1, the probability of the Ising prior is clearly 
maximized when all of the states have the same value, meaning that the k elements 
point up. Under certain assumptions 7 the only state elements that can change are 
those on either side of the dashed boundary, leading the value of k to undergo a ran- 
dom walk over time, as illustrated in Figure 8.1 8(b), a well-known Markov chain (the 
"Gambler's Ruin" problem [248]). The optimum, at k = 0, is eventually reached, but 
the number of iterations is quadratic in k. 



7 The sampling of such processes will be discussed in Section 1 1.3; we are assuming a Gibbs 
sampler with a random site- visiting scheme at temperature T = 0. 
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Fig. 8.19. The downsampling of Ising models: strongly coupled fields become more strongly 
coupled, top, and weakly coupled fields more weak, bottom. 



Consequently, for local random field models it is exceptionally difficult to synthesize 
structures large in size relative to the local neighbourhood. 

In response to this observation, we are motivated to consider changes of basis for 
discrete- valued random fields. The key difficulty is that discrete state fields do not 
allow a tapered basis. That is, essentially all of the continuous state bases — spatial 
functions in Figure 8.8, overlapped methods in Figure 8.9, hierarchical triangles in 
Section 8.4.1, wavelets in Section 8.4.2 — rely on a large-scale basis element being 
spatially tapered. In the continuous- state case, a large-scale element can be a smooth, 
low-resolution representation, which is then incrementally nudged and refined to- 
wards finer scales. However, a discrete hierarchy does not allow for a smoothly- 
varying representation or for small refinements in state value from scale to scale. 

Fundamentally, the issue is that discrete fields are not closed under addition or mul- 
tiplication, meaning that one cannot even write z = Fz, as in (8.1), for discrete z. 
That is, a literal change of basis will not be possible. 

It is, however, possible to construct discrete- state hierarchies based on downsam- 
pling. For example, given an energy function H(z), we can seek to find the energy 
H k (i k z) corresponding to the field after k scales of downsampling. It is possible to 
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Fig. 8.20. Downsampling a 2D Ising model exaggerates the difference between the coupling 
f3 and the critical value of 0.4407, as was illustrated for two cases in Figure 8.19. 



derive H k analytically, per the seminal work by Gidas [136], however in practice it 
is common to assume the form of the energy function, 



H\[ k z)=H([ k z,0 k 



(8.116) 



with the scale dependence only in model parameter 6. The famous example of such 
downsampling is the Ising model (Section 7.4.1) [26, 171, 335] which, after down- 
sampling, remains an Ising model but with a different degree of coupling: 

• A weakly coupled field possesses only small structures. When downsampled, the 
structures become even smaller, thus closer to random 

=> Even more weakly coupled 

• A strongly coupled field has most pixels with the same state value, with occasional 
outliers. When downsampled, outliers tend to be removed 

=> Even more strongly coupled 

as illustrated for two examples in Figure 8.19, and plotted as a function of (3 in 
Figure 8.20. 

The intent of the Ising examples is to illustrate that a hierarchical representation, 
even if not a change of basis, is plausible for discrete- state fields. There are two 
hierarchies to consider: 



1. Top-down, in which the hierarchy begins at a coarse scale, with the coarse-scale 
elements repeatedly refined at finer scales. 
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Label Field Random Groups Sampled Ungrouped Random Groups 

Swendsen-Wang Repeated Random Grouping [297] 








32265 Groups 8684 Groups 4235 Groups 3406 Groups 1293 Groups 
Fine-to-Coarse Hierarchical Grouping 

Fig. 8.21. Two approaches to discrete- state grouping. Groups may be repeatedly created and 
destroyed, top, or groups may persist, bottom, and be grouped further. The hierarchical group- 
ing leads to much larger groups and better computational efficiency, but may suffer from the 
persistence of a poor grouping. 



2. Bottom-up, in which the hierarchy begins at the finest, pixellated scale, and 
where some sort of grouping or aggregation leads to coarser representations. 



For discrete-state problems, the bottom-up approach is considerably more common, 
precisely because of the ambiguity in representing a discrete field at a coarse scale. 
Because the field is discrete, however, it will be common to have adjacent elements 
with precisely the same value, leading to a natural notion of grouping, as illustrated 
in Figure 8.21. The Swendsen-Wang method [184,297] was proposed for the ac- 
celeration of Ising-Potts models, whereby groups of state elements are changed si- 
multaneously, rather than one pixel at a time. Because the domain is periodically 
ungrouped and regrouped, no grouping assignment is ever fixed, meaning that a poor 
grouping at some iteration does not compromise the final result. 

In contrast, methods of region grouping [217, 314, 329] lead to much more rapid 
convergence by growing much larger groups, but with the cost that grouping cannot 
be undone, meaning that grouping errors persist in future iterations. 

Top-down approaches, which dominate in continuous- state problems, can also be 
applied in the discrete-state case, as shown in Figure 8.22. In a regular coarse-to- 
fine hierarchy [5], the result from a coarser scale becomes the initialization at a finer 
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Fine 




Coarse- to-Fine Binary Hierarchy [5] 



Coarse 



Fine 










Coarse- to-Fine Ternary Hierarchy, Sampling Only Grey Elements [50] 

Fig. 8.22. Two approaches to discrete- state top-down hierarchies. The result at each scale may 
only serve as the initialization of the next, top, or each scale may constrain the behaviour of 
the next, bottom. 



scale, leading to the obvious problem that the fine scales are not prevented from 
diverging from the coarse- scale initialization. 

Although not literally a change of basis, the ternary approach [50] illustrated in the 
bottom panel of Figure 8.22 is close in spirit to an orthogonal representation. The 
binary state is augmented to ternary by adding an undecided condition; all decided 
elements (black or white) are fixed, such that at any scale the large-scale structure 
inherited from coarser scales is unchangeable, and only the undecided elements are 
sampled. The small fraction of undecided pixels at fine scales leads to computational 
benefits similar to those of basis reduction. 



Application 8: Global Data Assimilation [111] 



Let us consider an example, building on the state-reduction methods described in 
Section 8.2.2. 

Suppose we wish to simulate a General Circulation Model, a model of the world's 
oceans which is driven by observed data, such as sea-surface temperature. A good 
representation of the partial differential equations governing water circulation re- 
quire a very fine resolution, however running a Kalman-like filter and producing 
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150 200 

Longitude 

Fig. 8.23. Mapping test for global-scale problem, from [111]. The coarse grid is 71 x 62, 
superimposed on a 2160 x 960 fine grid. The centred locations of the 3551 interpolants are 
shown as white dots. 



error statistics can be done only at a relatively coarse scale, hence we have a state- 
reduction or two-scale problem. 

Specifically, suppose we have the global problem illustrated in Figure 8.23. The pre- 
diction step is performed by some numerical differential equation 



z(t+l\i) =A(z(t\t)). 



(8.117) 



The error statistics are updated using a reduced-order Kalman filter (Section 10.2.6), 
which operates on a much smaller state z, 



z(t\t) = lC(z(t\t-l)), 



(8.118) 



where the number and locations of the reduced states are indicated by the white dots 
in Figure 8.23. 

The key challenge is therefore the transformation between the fine z and coarse z_ 
scales, as in (8.8): 

z = Fz z = Sz, (8.119) 

in particular, such that the transformation does not distort the coarse-fine-coarse 
cycle, meaning that we wish the pseudoinverse criterion 



F{S(z)) 



(8.120) 



to be satisfied, exactly as discussed in Section 8.2. Figure 8.24 shows one fine- 
coarse-fine mapping, using a Gaussian kernel from Figure 8.8. 
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150 200 

Longitude 



Fig. 8.24. A fine-coarse-fine mapping from Figure 8.23 [111], meaning that this fine-scale 
image was reconstructed from only 3551 coarse-scale values, a compression from the fine 
scale by a factor of over 400. Observe the absence of distortions, even along boundaries. 



Summary 



The following table gives a quick overview of the methods developed in this chapter 
and the contexts to which they apply. 

Reductions of Basis: 



Section Method 



Basic Assumptions 



8.2.1 Principal Components Problem stationarity along some dimension 

8.2.2 Fast Pseudoinverses Spatial smoothness 

8.2.3 Local Processing Limited correlation, dense measurements 



Changes of Basis: 

Section Method 



Basic Assumptions 



8.3 FFT Full stationarity and periodicity 

8.4.1 Hierarchical Triangles None 

8.4.2 Hierarchical Wavelets None 
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For Further Study 

For aspects of principal components, the text by Jolliffe [179] is recommended. 
For interested readers, the generalization of principal components to independent 
components analysis is discussed in the elegant text by Hyvarinen, Karhunen, and 
Oja[166]. 

The papers by Szeliski [299] and Yaou and Chang [345] look at the solving of two- 
dimensional inverse problems using triangular and wavelet bases, respectively. 

This chapter has given only the most basic introduction to simple, orthogonal 
wavelets, whereas there are a great number of orthogonal, bi-orthogonal, non- 
orthogonal, complex, complete, and overcomplete wavelet variations from which to 
choose. The text by Strang and Nguyen [293] is an excellent place to start. 



Sample Problems 



Problem 8.1: FFT Sampling 

(a) Use the FFT method to generate a random sample of each of the three kernels 
in Example 6.1. 

Consider the thin-plate kernel in Example 6.1. Set the central element to one of 
20.1, 20.01, 20.001, 20.0001; for each of these four values do the following: 

(b) Use the FFT method to generate a random sample. 

(c) Use the FFT method to find the eigenvalues of the kernel, and from those 
infer the condition number. 

Problem 8.2: FFT Estimation 

Consider the thin-plate kernel in Example 6.1. Set the central element to one of 
20.1, 20.01, 20.001, 20.0001; for each of these four values do the following: 

(a) Use the FFT method to generate a random sample. 

(b) Add unit variance Gaussian noise to each state element. 

(c) Use the FFT method to compute state estimates and estimation error vari- 
ances. 

(d) Comment on how and why the estimation error variance varies with the se- 
lected prior kernel. 
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Problem 8.3: Basis Reduction 

Suppose we construct an estimation problem similar to the one in Figure 8.2: 
z is one-dimensional, with 200 elements, having a first-order (membrane) prior. 
Because the first-order prior is singular, we will add an additional constraint 

f = [1 1 ••• 11] (8.121) 

which penalizes the deviations of z from zero. 

(a) Calculate the prior co variance P. 

(b) Find the eigenvector associated with the largest eigenvalue of P. 

(c) Find the eigenvector associated with the eigenvalue of P closest to zero. 

(d) Describe briefly the behaviour of a reduced-order system, on the basis of the 
eigenvectors preserved (large eigenvalue) and rejected (small eigenvalue). 

Problem 8.4: Basis Reduction 

Construct a one-dimensional estimation problem, as in Problem 8.3, but for both 
first-order and second-order priors, with two measurements: 

m 1 = z 50 + vi ~ A/"(l, 1) m 2 = z 150 + v 2 ~ A/"(l, 1). (8.122) 

Calculate the covariances Pi , P 2 for the first- and second-order problems, respec- 
tively, and find the sorted singular values ajj of P 3 ■,: 

(a) Plot the singular values g\^ and cr 2 ^. 

(b) Interpret the singular value plot in terms of the degree to which the basis can 
be reduced. 

(c) Interpret the singular value plot in terms of the condition number of the orig- 
inal problem. 

Problem 8.5: Open-Ended Real-Data Problem — Wavelet-Based Estimation 

Let us apply the wavelet change of basis from (8.103) to image denoising, a 
process known as wavelet shrinkage [56, 87]. Find an image Z (from the Inter- 
net): 

(a) Add Gaussian noise V to form measurements M = Z + V. 

(b) Take the wavelet transform M = WT(M), for example using wavedec in 
MATLAB with wavelet 'db4'. 

(c) Our "estimator" will be a simple, nonlinear threshold: 

f,= (° '-*! <C (8.123) 

\m i \mi\>C 
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(d) Take the inverse wavelet transform Z = WT _1 (Z). 

(e) Experiment with different choices for the settings under your control: 

(i) The type of noise: Gaussian, salt-and-pepper 
(ii) The type of wavelet: DB1, DB2, DB4, . . . 
(iii) The type of estimator: hard thresholding, soft thresholding etc. 
(iv) The choice of threshold ( 



Part III 



Methods and Algorithms 



Linear Systems Estimation 



One of the most fundamental equations in this book is the solution to the Bayesian 
linear least-squares estimator of (3.1 14): 

z(m) = Ul + {P~ X + ^R^Cy 1 C T R- 1 (m - C/x) (9.1) 

P = cov(i) = (P- 1 + CFR^C) _1 . (9.2) 

The presence of matrix inversions clearly limits the size of the problem (the number 
of elements in z) to which these equations can be applied directly. The enormous sig- 
nificance of these equations, however, is that in every least-squares problem, whether 
static or dynamic, possibly in some reduced dimensional space or transformed basis 
(as in Chapter 8), (9.2) needs to be solved. 

Although (9.2) makes explicit reference to Bayesian prior P, we could equally well 
be solving a non-Bayesian least-squares estimation problem 

z(m) = (L T L + CFR^C) _1 C T R- X m. (9.3) 

As the two forms are algebraically equivalent (Section 3.2.4), and as the focus of 
this chapter is the algorithmic solution of an algebraic problem the Bayesian / non- 
Bayesian differences are not relevant here, so without loss of generality we consider 
the estimator 

z = (Q + CFR^C) _1 C T R- X m (9.4) 

from which follows the linear system 

(Q + CFR^C) z = (C^- 1 ) m , (9.5) 



a system known as the normal equations. We are thus left to solve a regular linear 
system or its preconditioned equivalent (8.7) 

Az = b or S T ASl=S T b. (9.6) 
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As A has the same size as P, it is infeasible to consider a dense storage of A for 
large problems, so a sparse (Section 5.3) or implicit representation of A is needed; 
in particular, we need efficient ways of computing the matrix-vector product 

Az = Qz + C T R~ 1 Cz. (9.7) 

For example, given a stationary problem with uncorrelated point measurements, then 
C = I,R = Diag(r), and Q has a kernel Q, thus 



Az = Qz + I T (Dmg(r)) 1 Iz-- 



Q*Z 



■z(2>r, (9.8) 



much faster and requiring vastly less storage than the explicit calculation with full 
matrix A. Even with a more complex problem, say a nonstationary Markov prior 
with correlated point measurements, then A is still sparse, with b bands, where b 
is a function of the Markov order of the prior and the correlation structure in R. 
In this case the product Az is still efficient, although the storage requirements for 
A are b times the storage size of z. In many such nonstationary cases, for example 
the interpolation of Example 5.1, matrix A (or its components Q,C,R) may not be 
stored at all, but rather be represented implicitly by being generated dynamically by 
a computer algorithm: 

Az = a(z). (9.9) 

This is particularly useful for problems in which most state elements are stationary, 
represented by a kernel, with a limited number of exceptions (e.g., boundary condi- 
tions) to be computed separately. Under the assumption that the exceptions are not 
computationally difficult to detect, the computational complexity of computing a(z) 
should be the same as that for the sparse matrix-vector product Az. 

For Bayesian estimation problems it is also possible [298] to solve P using linear 
systems techniques: 

P = (P _1 + ^R^Cy 1 => (P- 1 + CFR^C) P = L (9.10) 

Thus the columns of P = [uo Ei • • •] can eacn be solved as 

{P- 1 +C T R- I c)2i = e z , (9.11) 

where e { is the zth unit vector. However, for an n-element random field z there will 
be n separate linear systems to solve, thus for all but the very smallest problems 
(9.11) is, at best, an exploratory approach, whereby a few individual columns of P 
could be computed to acquire some understanding of the behaviour of P. 

With linear systems solving firmly connected to estimation, the remainder of this 
chapter surveys seven approaches to solving them, divided into direct and iterative 
methods. In general, direct methods have the advantage of being exact (excepting nu- 
merical errors) and having a fairly predictable computational complexity. The great 
attraction of iterative methods is their much simpler implementation and their com- 
patibility with sparse A or implicit aQ_ of (9.9). However since the iterative methods 
gradually approach the solution, questions of accuracy and computational complex- 
ity may be problem-dependent. 
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9.1 Direct Solution 

Our canonical problem, appearing both in static and dynamic estimators, shown in 
(9.4), involves a matrix inverse: 

|(m) = (L T L + C T R- 1 C)~ 1 C T R- X m (9.12) 

= A~ 1 b. (9.13) 

It may be tempting, given such a problem, to compute the explicit matrix inverse 
A -1 , however this is inadvisable on grounds of numerical accuracy. A number of 
iterative approaches to solving (9.13) are discussed in Section 9.2, however there do 
exist non-iterative methods, superior to explicit matrix inversion, for solving such 
systems. These are discussed in the remainder of this section. 



9.1.1 Gaussian Elimination 

Gaussian Elimination, discussed in Appendix A.7.3, is a direct non-iterative ap- 
proach to solving linear systems. We apply repeated row operations to the system 

to convert 

Row operations . 

A£ = b ► Iz = U- (9.14) 

Gaussian elimination is impractical for all but the smallest linear systems for the 
key reason that in order to be able to apply row operations to A, we need a full, 
dense, explicit version of A. Furthermore, as Gaussian elimination proceeds, the row 
operations make A increasingly dense, even if initially sparse. Therefore Gaussian 
elimination is either inapplicable, or extremely undesirable, given sparse, kernel, or 
implicit representations of A. 

It is common to use the LU decomposition (Appendix A.7.3), which yields triangular 
matrices L and U, allowing the linear system to be solved by backsubstitution. 



9.1.2 Cholesky Decomposition 

Because the normal equations in A are known to be symmetric, positive-definite, 
the Cholesky decomposition is preferred over the LU decomposition for reasons 
of implementation simplicity, numerical robustness, and computational complexity. 
The Cholesky decomposition, discussed in Appendix A.7.3, takes a given positive- 
definite matrix A and computes its matrix square root 

A = GG T , (9.15) 

where G is lower triangular. The method, shown in Algorithm 4, is attractive because 
of its implementation simplicity (the reader may observe the absence of pivoting 
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Algorithm 4 The Cholesky Decomposition(many variations exist, see [132, 264]) 



Goals: Find the matrix square root 


Gof 


symmetric, positive 


definite n 


x n matrix 


A 


Function G = Choi 
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issues common in Gaussian elimination), numerical robustness, and a computational 
complexity of one half of that of the LU decomposition. 

It is the triangularity of G (in both the Cholesky and LU decompositions) which is 
key to the method's efficiency. Given 

GG T z = b, (9.16) 

using backsubstitution we can easily solve for x in 

Gx = b, (9.17) 

and then again using backsubstitution to solve for z in 

G T z = x. (9.18) 

For problems of modest size, solving a linear system using a Cholesky decomposi- 
tion followed by backsubstitution is a very credible and numerically robust alterna- 
tive to direct matrix inversion. 

9.1.3 Nested Dissection 



For problems of large size the Cholesky approach may remain tractable, however 
subtleties appear. The decomposition algorithm of Algorithm 4 is written in dense 
form, however for most sparse A the decomposed triangular square root G may or 
may not be sparse. The degree to which G is denser than A is referred to as fill-in, 
and may be a function of the ordering of rows and columns in A [132], for example 
whether the state elements in Z should be lexicographically stacked in columns, in 
rows, or in some other ordering. 
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Fig. 9.1. Dividing a domain into decoupled parts, with the boundary elements placed at the 
end of Z, leads to a reordered A and block- sparse square root G, as shown. 



For example, given a small linear system consisting of four elements, the Cholesky 
factor G ends up dense, even though the linear system possessed sparsity: 
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(9.19) 



In this particular case, if the equations are written in reverse order, the recursive, 
sequential dependence of the Cholesky method preserves sparsity: 
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(9.20) 



For arbitrary nonregular grids or nonstationary MRF kernels, the choice of opti- 
mum element ordering to maximize Choleksy sparsity is very difficult and beyond 
the scope of this text [132, 133, 211]. However, for regular multidimensional grids, 
Nested Dissection [130, 131] is a method of reordering the elements in a linear sys- 
tem to preserve, as much as possible, the sparsity of the original. 

The method relies very much on the Markov principles of conditional decorrelation 
from Chapter 6. Suppose we can find a boundary which separates the domain into 
decoupled parts. If z is reordered, placing the boundary elements at the end, then 
G preserves the block- sparsity of the decoupled parts, as illustrated in Figure 9.1. 
The width of the boundary follows immediately from the order of the Markov prior 
present in the estimation problem leading to A. 

Clearly, having inserted one boundary to divide the domain, there is no reason why 
the domain cannot be further subdivided, leading to the recursive sequence illustrated 
in Figure 9.2, with obvious generalizations to higher dimensions. 
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Fig. 9.2. For a two-dimensional problem, further sparsity is realized by repeatedly subdividing 
along middle columns and rows. The thickness of each boundary portion is a function of the 
size of the kernel in A, equivalently the order of the Markov prior. 



The multiscale approach, discussed in Section 10.3, takes an approach philosoph- 
ically similar to nested dissection, running a Kalman filter along the sequence of 
dissection boundaries. 



9.2 Iterative Solution 

The previous sections considered direct methods to solve a given linear system. How- 
ever, most direct methods suffer from two difficulties: 

1 . Implementation complexity, and 

2. Computational and storage complexity. 



The remainder of this chapter examines iterative methods [153,278, 346]. We will 
denote by z(k) the solution of the linear system solver after k iterations, and by 
e(k) — z(k) — z the error in z(k), relative to the exact solution z. If the linear system 
is well-posed, then the exact solution is guaranteed to exist and to be unique. 

Related to the error is the residual r(k) = b — Az(k), which measures the degree of 
inconsistency of some estimate z with respect to the linear system. We note that 



r = b-Az = b- A(e + z) = b-Ae-b = -Ae, 



(9.21) 



thus the residual r is a transformed version of the error: e measures the error in the 
space of z; r measures the error in the space of b. 

Where appropriate, S(k) represents the direction in which we move z to try to find a 
better solution. That is, in general, 



z(k + l) = z(k)+a k 5_(k). 



(9.22) 



All of the remaining methods in this chapter are iterative and do not require a dense 
matrix A. What all of the following iterative methods have in common is that the 
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only thing they require of A is the ability to compute certain matrix-vector products 
involving A. In particular, the great attraction of the Conjugate Gradient method in 
Section 9.2.3 is that it only ever requires the product Az. 



9.2.1 Gauss- Jacobi / Gauss-Seidel 

The Gauss-Jacobi and Gauss-Seidel iterations [78, 153,278] are among the simplest 
linear system solvers, iteratively updating the scalar elements in z one at a time. It 
is useful to understand both the scalar and vector forms of the linear system and 
associated iterations: the scalar form is more closely connected to the algorithmic 
implementation, whereas the vector form is more convenient for analysis. 

If we let 

A = A D + A N (9.23) 

represent the decomposition of matrix A into diagonal and off-diagonal components, 
respectively, then we can rearrange the linear system 



Az — b ^2j a i,jZj = h 

A D z + A N z = b (aijZi + J2j^i a ijZj) = h 

to arrive at the Gauss-Jacobi iteration 

z(k + l) = A- 1 (b-A N z(k)) Zi{k + 1) = ^\ ,J JK ) 



(9.24) 



(9.25) 



Matrix- Vector Form Scalar-Algorithm Form 

Similarly, if 

A = A L + A D + A v (9.26) 

represents a decomposition into lower-triangular, diagonal, and upper-triangular 
pieces, then we can derive the Gauss-Seidel matrix-vector form 

Az = b 
A L z + A D z + Aul = b (9.27) 

z(k + 1) = (A L + Ad)- 1 (b - Auz(k)) 

and the scalar-algorithm form: 

T,j a ijZj =bi 

I Z-^j<i a iJ Z 3 ~l~ a i,i z i "+" Z-^j>i a iJ Z 3 J = ®i 
Zi(k + 1) 



bi ~ Ej<i aid z j(k + 1) - Y,j>i aiJ z j(k) 



(9.28) 



di' 
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Algorithm 5 Gauss-Seidel 



Goals: Iteratively solve the linear system Ax = b, x £ R n such that \\Ax — b\\ < £ 
Function x = GS(A, b, x(0), C) 

fc<- 1 

repeat 

for i <— 1 : n do 

£*(&) <- (&* - Ej<i a iJXj( k ) ~ J2j>i ai,jXi( k ~ 1)) /a*,i 
end for 
r fc ^0 
for z <— 1 : n do 

^fc <— rk + (6i — J2j a i,j%j(k)) Compute Residual 

end for 
until rk < C 



The main difference between the Gauss-Jacobi and Gauss-Seidel iterations is that 
the Gauss-Seidel iteration is in place, meaning that only one copy of z is needed, 
whereas Gauss-Jacobi needs to preserve all of z(k) while computing z(k + 1), ne- 
cessitating two copies. 

The simplicity of these two methods allows a straightforward analysis of their con- 
vergence properties: 

e(k + l) = z(k + l)-z = A' 1 (b - A N z(k)) - z 

= A' 1 (A D + A N )z - A^A N z(k) - z 
= z + A' 1 A N z - A^A N z(k) - z 
= -A~ 1 A N (z(k)-z). 



(9.29) 



That is, the Gauss-Jacobi errors obey a dynamic relationship 

ej(k + 1) = -A^A N ej(k) = -A' 1 (A L + Au)ej{k). (9.30) 

Comparing the Gauss-Jacobi and Gauss-Seidel forms in (9.25), (9.27), the corre- 
sponding error dynamics for Gauss-Seidel follow by inspection: 

e s (k + 1) = ~{A D + A^Auesik). (9.31) 

Thus the iterations converge if and only if the eigenvalues of A^ 1 (Al + Au) or of 
(Ad + Al)~ 1 Au, respectively, have a magnitude less than one (meaning that the 
spectral radius p is less than one). Because the spectral radius describes the speed of 
convergence of the slowest error mode 

f\\e k , ill \ 
lim ^±^ = p, (9.32) 

fc-°° y \\e k \\ J 

we frequently define a convergence rate 
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Fig. 9.3. Gauss-Jacobi and Gauss-Seidel residuals, as a function of iteration, for the complex 
two-dimensional problem of Example 5.1 on page 159. 



•ln(p), 



(9.33) 



where r measures the number of iterations for the error to be reduced by a factor of 
1/e, about 60%. 

It can be shown [78, 278] that the Gauss-Jacobi and Gauss-Seidel convergence con- 
dition is guaranteed if A is strictly diagonally dominant: 






3\' 



(9.34) 



The corresponding condition on the kernel A is that A is strictly diagonally dominant 
if 

i,jyo,o 

Thus the membrane kernel of Figure 5. 14 on page 158 is strictly diagonally dominant 
for all a > 0, whereas the thin-plate kernel satisfies dominance only if the central 
element is made rather large, a > \[ 7 2A. 

In practice the Gauss-Seidel method is typically more robust, converging for a wider 
variety of problems, and roughly twice as fast as Gauss-Jacobi. 

An example of the application of Gauss-Jacobi and Gauss-Seidel to a complex two- 
dimensional problem is shown in Figure 9.3. Gauss-Jacobi fails to converge, and 
indeed diverges rapidly. Gauss-Seidel converges, however it is clear from the plotted 
surface that the estimates respond locally to the measurements with an exponentially- 
decaying response away from the measurements, precisely along the lines of Fig- 
ures 8.1 and 8.2. 

Both Gauss-Jacobi and Gauss-Seidel are very primitive methods and are, on their 
own, an unrealistic choice for any linear system of significant size, so good perfor- 
mance is in no way expected in Figure 9.3. Modest improvements in computational 
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Example 9.1: Iterative Solutions to Interpolation 



We can consider solving an estimation problem using Gauss-Jacobi or Gauss- 
Seidel. Suppose we have an n-element one-dimensional random process with 
first-order regularization constraints, where we measure the 10th, 50th, and 90th 
elements with unit- variance noise. Thus 



A = C T R~ 1 C + \LlL x , C e. 



p3xn 



R = h L x e 



(9.36) 



with each row of L x a first-order penalty term [0 ... -1 1 ... 0] , as 
in Section 5.5. Applying Gauss-Jacobi (9.25) (dotted) and Gauss-Seidel (9.27) 
(dashed) to this linear system yields 




Spatial Element Index 

with solutions plotted after 1, 10, 100, 1000 iterations, clearly showing a conver- 
gence to the exact solution, plotted as a solid line. The more rapid convergence of 
the Gauss-Seidel method is clear. It is also clear that convergence is more rapid 
near measurements. For a given set of measurements the convergence rate r (9.33) 
decreases as the process length increases: 




20 40 60 80 100 120 140 161 

Problem Size (Elements) 



Here r was computed from the exact eigendecomposition of the error dynamics 
as a function of problem size n. 
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complexity can be realized by the SOR method (Section 9.2.2). However, for prob- 
lems of any significant size, to be effective GJ/GS require some sort of problem trans- 
formation to break the indirection problem of Figure 8.1, seen vividly in Figure 9.3. 
A variety of such transformations was discussed in Chapter 8 and preconditioning, 
specifically of iterative algorithms, is covered in Section 9.2.4. 



9.2.2 Successive Overrelaxation (SOR) 

The principle of overrelaxation [78, 153,278,289] is to take an iterative method, most 
commonly Gauss-Seidel, and to adjust the change (z_(k + 1) — z(k)) to accelerate 
convergence. 

Specifically, a given iterative scheme is modified as follows: 

Given Iteration: z(k + 1) = z(k) + S(k) 

OverrelaxedForm: £{k + 1) = £(k) + u5(k) < uo < 2, ( ' ) 

multiplying the intended adjustment 5_{k) by some constant uo. 

Computing S(k) = (z(k + 1) — z(k)) as the iteration-to-iteration difference, the 
overrelaxed forms of the Gauss-Jacobi and Gauss-Seidel iterations can be derived 
from (9.25), (9.27) as 

GJ: z(k + 1) = z(k) + uo [A' 1 (b - A N z(k)) - z(k)] 

5(k) 
GS: z(k + 1) = z(k) + uo \{A L + Ad)' 1 (b - A v z(k)) - z(k)] . (938) 



5(k) 

The idea is to select uo to reduce the spectral radius of the iteration, thereby acceler- 
ating convergence. 

Given iterative error dynamics 

e(k + l) = Qe(k), (9.39) 

the effect of uo on the overrelaxed iteration is easily derived: 



Basic Iteration Overrelaxed 

SOR , . SOR / . / . 

z(k + 1) = z(k) + 5(k) 1 (fc+l)=l (k)+cu6(k) 

SOR SOR 

e(k + 1) = e(fc) - S(k) => e (k + 1) = e (jfe) - wJ(fc) 

*(*) = (I-Q)c(fc) =(/- w (J-Q))e S0R (fc) 

That is, the overrelaxed error dynamics are given by 



(9.40) 
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uj < 1 a; = 1 uj < 1 

Fig. 9.4. The principle of overrelaxation: Parameter uj controls the degree to which the eigen- 
values of the iteration error dynamics are pulled towards 1.0 (left) or pushed away from 1.0 
(right). The goal is to choose uj to maximize the distance from ±1 to the closest eigenvalues. 



Q SOR (co)=(l-co(I-Q)). (9.41) 

The key question, then, is how the spectral radii p(Q), p(Q (uj)) compare. Given 
the eigendecomposition of Q, 

Qv z = \iV z , (9.42) 

it is possible to analytically derive the eigendecomposition of Q (uj): 

Q S ° R (u)Vi = (/ - "(I - Q))v, = (1 - uj + ujXi)^. (9.43) 

That is, the eigenvectors of the overrelaxed iteration are unchanged, and the eigen- 
values of the error dynamics are modified as 

A . ^H X° R ( L o) = l-cu+LoX i , (9.44) 

So values < uj < 1 pull eigenvalues towards 1.0, whereas values 1 < uj < 2 
push eigenvalues away from 1.0, as illustrated in Figure 9.4. This is actually quite 
intuitive: 

• An eigenvalue less than zero implies an error which oscillates in sign. Therefore 
we conclude that the iterative step went too far, and a smaller step (uj < 1) should 
push the error eigenvalue closer to zero. 

• An eigenvalue close to one implies an error which is decaying only very slowly. 
Therefore we conclude that the iterative step was too small, and a larger step 
(uj > 1) should help to accelerate the method by pushing the eigenvalue towards 
zero. 

The optimum choice of uj then corresponds to minimizing the spectral radius: 

/ SOR x 

^Optimum = ar go; mm P[Q M) • (9.45) 

There are two primary limitations to the use of SOR in solving estimation problems: 
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Example 9.2: Eigendistributions of Iterative Interpolators 



Consider the one-dimensional estimation problem of Example 9.1 with a process 
length of n = 100. We consider both first-order (L x ) and second-order constraints 
(L xx ) in (9.36). Four sets of eigenvalues are shown below, plotted in the complex 
plane: 
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The instability of Gauss-Jacobi applied to second-order constraints can be seen 
from the presence of eigenvalues outside of the dashed unit circle. 

The eigenvalues of first-order Gauss-Jacobi are symmetric about zero, so SOR 
has nothing to offer here. However, the Gauss-Seidel eigenvalues are concentrated 
to one side, so a limited benefit can be gained by overrelaxing Gauss-Seidel with 
uo > 1, pushing the eigenvalues away from 1.0. 



1. A, and therefore Q, are typically huge matrices whose eigendecomposition is 
unknown. There are special cases for which the optimum u is known exactly, 
however, in general finding suitable uo may be a matter of trial and error. 

2. If the error dynamics have eigenvalues near one (see Example 9.2) then SOR can, 
at best, only double the convergence rate r: 
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= 1 



A = l-e ^A = l-2 + 2A = l-2e 



(9.46) 



, /„ x SOR _ 

r = -ln(l-e) ~e r ~ 2e 



If there are eigenvalues present with values less than zero, then to prevent insta- 
bility uo must be chosen somewhat less than two, with a corresponding reduction 
in convergence rate. 

In the limiting case where multiple eigenvalues are distributed symmetrically 
about zero, then cjoptimum = 1, implying that SOR offers no benefit. 



9.2.3 Conjugate Gradient and Krylov Methods 

Conjugate Gradient (CG) is probably the most popular general-purpose method for 
iterative linear systems solving, finding a nice balance between the trivial implemen- 
tation and moderate performance of Gauss-Jacobi and Gauss-Seidel, and far more 
complex methods such as multigrid (Section 9.2.5). 

CG is discussed in an enormous number of books and papers [34,43, 141, 142, 153, 
268], however the tutorial in [285] is highly recommended for anyone seeking to 
understand CG at a deeper, intuitive level. 

CG is essentially a method of steepest or gradient descent — sliding down the gradi- 
ent of some objective function, seeking a minimum. Given the linear system Az = b, 
where A is symmetric and positive-definite, we define the objective function 

f(z) = ^z T Az-b T z (9.47) 

from which it follows that 

f(z) = ^ = Az-b f"(z) = g = A > 0. (9.48) 

We see that the first derivative contains our linear system of interest, so our goal is to 
find z such that f'(z) = 0, meaning that we have found a minimum of /. 

The method of steepest descent seeks to move down the gradient of / to find a 
minimum. This gradient is found particularly easily: 

f'(z(k)) = Az(k) -b= -r(jfe). (9.49) 

That is, the residual points in the direction of steepest gradient descent, and the solu- 
tion is iterated as 

z(k + 1) = z(k) + ar(k). (9.50) 
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Fig. 9.5. An example of steepest descent. The contours show the values of f(z), with twenty 
iterations of steepest descent superimposed. We observe how successive steps move at right 
angles to each other; it is also clear that many, many steps are taken in the same direction until 
the desired minimum is reached. 



The distance a that we move along r(k) is selected to minimize f(z(k) + ar(k)), 
thus setting the partial derivative to zero: 



df{z(k + i)) 

da 



mt + 1)) *«£±i)_o 

=> f , {l(k + l)) T r(k) = 
=> -r(k + l) T r(k) = 0. 

Substituting r(k + 1) = b — A(z{k) + ar(k)) allows us to solve for a: 

r(k) T r(k) 



a = 



r(k) T Ar(k)' 



(9.51) 



(9.52) 



The problem with steepest descent, especially if A is poorly conditioned, is that 
we may end up converging very slowly, requiring many, many iterations to reach a 
reasonable answer. 

The problem, as is made clear in Figure 9.5, is that steepest descent may end up 
moving in the same direction many times. Ideally we would go the "right" distance 
along some direction r(k), not such that we end up at a minimum of / (what we did 
before), which led to the orthogonality 



r(k + l) T r(k) = 0, 



(9.53) 
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but rather that we move as far in the current direction as we need to, implying that 
the remaining error be orthogonal to the current direction: 

e(k + l) T r(k) = 0. (9.54) 

Unfortunately we can't do this, since we would need to know the exact solution z in 
order to know e. 

The beauty of conjugate gradient is that it cleverly selects a special set of directions 
in which to move, which allows steepest descent to move the "right" distance along 
each direction, eliminating the error along that direction in one iteration. Specifically, 
if we can find a set of A-orthogonal directions {5(h)}, meaning that 

S(k) T AS(j)=Q, J^k (9.55) 

then steepest descent will ensure, after each step, that the error is A-orthogonal to 
the chosen direction: 

e(k + l) T A5.(k) =0. (9.56) 

Having eliminated the error in some direction, it remains eliminated in future itera- 
tions. That is, let us suppose that 

e(k) T AS_(m) =0 ra = l,...,n-l. (9.57) 

Then the form of the iteration allows us to see how the error evolves: 

z(k + 1) = z(k) + a(k)S(k) =» e(k + 1) = e(Jfe) + a(k)S(k) (9.58) 

from which the A-orthogonality of e(k + 1) follows: 

e(k + l) T A6(i)-( ° j = fcby(9.56) 

= by (9.57) = by (9.55) 

Therefore each iteration k permanently eliminates the error along direction S(k). 
However, if the linear system is well-posed then A must have full rank, in which case 
the A-orthogonal directions {£(1), . . . , S(n)} must span R n . Therefore the error will 
equal zero in n iterations ! Our enthusiasm for this conclusion should be tempered by 
two facts: 

1 . This number of iterations n is determined by the number of unknowns. For a large 
1000 x 1000 2D spatial problem, the number of iterations would ben = 10 6 , 
possibly computationally infeasible. 

2. Because of numerical rounding, after n iterations the error is, in fact, not likely 
to equal zero. 
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Algorithm 6 Conjugate Gradient 



Goals: Iteratively solve the linear system Ax = b with tolerance \\Ax — b\\ < C 
Function x = CG(A, 6, x o5 C) 



Steepest Descent 

Update Residual 
Gram-Schmidt for Next Direction 



d ^L ^b 
while \\r k \\ > £ do 

llr ll 2 

ii / £. ii 


d k M k 

%-k+l ^~ £-k + a fc^fc 

z: fc+ i ^ z: fc - «fc^4 

h 7 , HUfc+ill 


^ " WL k P 
4+i <- Zlfe+i + hd k 
k^k + 1 
end while 



The key question is how to find the A-orthogonal directions {5_(k)}. Given a set of 
linearly independent vectors {u(k)}, the conjugate Gram-Schmidt procedure (Ap- 
pendix A.7.3) can compute an A-orthogonal set. Because the residuals {r(k)} can 
be shown to be orthogonal, and therefore linearly independent, they are a straightfor- 
ward choice: u(k) = r(k). If we let V k represent the space explored after the first k 
iterations, then by definition 

V n = span {£(1), 5(2), . . .,5(k)} (9.59) 

= span{r(l),r(2), . . . ,r(fc)} . (9.60) 

However, because the residuals obey the recursion 

r(jfe) = r(k - 1) - a(k - l)AS(k - 1), (9.61) 

it follows that the space V k equals the space V^-i extended with direction AS(k—l). 
However, as S(k — 1) 6 V^-i, it follows that 

V k =V k _ l ®AV k _ 1 (9.62) 
from which it follows that 

V k = span {5(1), A6(l), . . . , A*" 1 ^)} (9.63) 

= span {r(l), Ar(l), . . . , ^"^(l)} . (9.64) 

This sort of space construction is known as a Krylov space — the repeated applica- 
tion of a linear operator A to some initial vector. 

The form of this Krylov space is particularly convenient because the residual r(k) is 
already A-orthogonal to all of 5(1), . . . , S(k — 2), leaving only one step of Gram- 
Schmidt to remove the relationship with S(k — 1). 

We are thus left with the elegant conjugate-gradient procedure of Algorithm 6, con- 
sisting of three iterated steps: 
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Residuals 



Gauss-Seidel Estimate 
(GS: 20 Iterations) 



Conjugate Gradient Estimate 
(CG: 20 Iterations) 



Fig. 9.6. Continuing from Figure 9.3, looking at a complex two-dimensional system, compar- 
ing Gauss-Jacobi, Gauss-Seidel, and Conjugate Gradient. 



1. Do one step of steepest descent, eliminating the error in the current direction. 

2. Update the residual based on the changed value of the estimate. 

3. Based on this residual perform one step of conjugate Gram-Schmidt to find the 
next direction. 

A variety of criteria may be proposed by which the algorithm is stopped, whether 
one iteration per state element (for the nominally exact answer, excepting numeri- 
cal errors), a fixed predetermined number of iterations, or based on the size of the 
residual. 

Figure 9.6 shows a two-dimensional estimation problem, comparing the convergence 
of Gauss-Seidel and conjugate gradient. Although the difference in convergence is 
not, in this particular example, terribly striking, conjugate gradient has a major ad- 
vantage: whereas Gauss-Jacobi and Gauss-Seidel require computing certain diago- 
nal or off-diagonal elements of A, the only thing which conjugate gradient requires 
is an ability to compute the matrix-vector product Az. This advantage is significant 
when A is algorithmically generated, as in (9.9), and particularly in the context of 
preconditioning, discussed in Section 9.2.4. 

Conjugate gradient is only one of many Krylov algorithms [168, 321]. Other widely- 
used forms include GMRES [277, 278] and biconjugate gradient [278]. 



9.2.4 Iterative Preconditioning 

Preconditioners [39, 63, 79, 98, 348] were discussed in the context of conditioning 
inverse problems in Section 2.3, and many of the methods discussed in Chapter 8 
are essentially preconditioners — changes or reductions of basis in order to make a 
given problem numerically or computationally simpler. 
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Fig. 9.7. This figure shows the preconditioned version of Figure 9.6. Clearly the precondi- 
tioned approaches require fewer iterations to converge; however in the case of GS, precondi- 
tioning significantly increases the computational complexity per iteration because the system 
matrix A is made less sparse. GS was preconditioned using a DB4 wavelet and CG using the 
hierarchical triangles of Section 8.4.1. 



Assume, then, that after whatever problem reduction / decoupling of Chapter 8 has 
been applied, that we are left with a linear system to solve. For linear system prob- 
lems of any significant size, unless the problem is very well conditioned (e.g., an 
exponential prior model with a short correlation length), it is almost certainly inap- 
propriate to be considering a plain Gauss-Seidel or conjugate-gradient implementa- 
tion. 

All preconditioners follow the basic approach seen in (8.84) in Chapter 8: 

Az = b — ► S T ASz = S T b — ► Az = b 



Easy ? 



(9.65) 



The two fundamental questions, then, are the possible choices of preconditioner S, 
and the associated computational issues. 

The Gauss-Seidel algorithm considers each state element individually, in turn, so 
some sort of preconditioner providing state mixing is needed, such as the DB4 or- 
thogonal wavelet preconditioner in Figure 9.7. Because the Gauss-Seidel algorithm 
makes explicit reference to the elements of A, the preconditioned system matrix 
A = S T AS needs to be calculated and stored explicitly. In the 2D problem of 
Figure 9.7, although preconditioned GS required fewer iterations to converge, the 
computational complexity per iteration was increased by a factor of 58: 



2-10 Nonzero Values in A 



1 • 10 b Nonzero Values mA = S 1 AS 
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Conjugate Gradient Estimates 
(100 Iterations) 



Preconditioned CG Estimates 
(100 Iterations) 



Fig. 9.8. Conjugate gradient can be applied to large-scale problems, here the cut-fold problem 
of Example 5.1 and Figure 9.6, but sixteen times larger on a 257 x 257 domain. 



The preconditioning of conjugate gradient is a bit more subtle [285], and the follow- 
ing discussion can only be considered an introduction. Conjugate gradient iterates 
over directions S(k), rather than over coordinates as does Gauss-Seidel, therefore 
rotating the entire problem domain has no effect on CG. Because an orthogonal ma- 
trix is essentially a multidimensional rotation, orthogonal preconditioners (such as 
the Daubechies wavelets) therefore have no effect on CG. 

The beauty of CG is that requiring only the matrix-vector product Az allows the 
preconditioning to be implemented implicitly — never does A need to be stored. 
That is, under a preconditioner S, the product Az is easily computed as 



Az = S 1 ASz 



\A(Sz 



(9.66) 



Indeed, not even the preconditioner S and the system matrix A need to be stored 
explicitly. Given our usual constraints matrix L, implying 



A = C T R~ 1 C + \L T L, 

then the product Az should be found implicitly as 

z = FunctionS(z) 
Az = FunctionST(C T iT 1 (C'l) + \L T (Lz)), 



(9.67) 



(9.68) 
(9.69) 



where Funct ionS(x) and Funct ionST(x) are algorithms, returning the matrix- 
vector products Sx, S T x, respectively. With such an approach, although A may have 
a higher density, these elements are never explicitly computed or stored, so the pre- 
conditioner adds very little computational overhead. 
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This latter, implicit approach allows preconditioned CG to be applied to much larger 
linear systems than GS. Figure 9.8 shows conjugate gradient applied to a problem of 
size 257 x 257, having 257 2 = 66 049 unknowns, which was solved using (9.69) in 
Matlab on a regular computer. 



9.2.5 Multigrid 

The previous sections have looked at very common, general-purpose methods of 
solving linear systems, methods that appear in countless textbooks and come as stan- 
dard subroutines in numerical packages such as MATLAB. In this section we want to 
look at multigrid, a method which focuses explicitly on the solving of large, multidi- 
mensional linear estimation problems. 

The multigrid method seeks a hierarchical approach to solving a linear system, repre- 
senting a multidimensional problem at a number of resolutions. We have seen similar 
hierarchical approaches before: 

• Nested dissection (Section 9.1.3) sought to divide a given spatial problem into 
decoupled pieces, an approach that was particularly useful for random fields gov- 
erned by a low -order Markov prior. 

However, although nested dissection is hierarchical, it is not multiresolution. That 
is, the elements being manipulated and rearranged are all individual state elements 
from the finest scale. 

• Hierarchical changes of basis (Section 8.4) converted a linear system on a single 
scale into a new, changed linear system on a hierarchy of scales, which could then 
be solved via methods such as GS or CG. 

The hierarchical basis with an iterative solver has a great deal in common with 
multigrid, however whereas a hierarchical basis leaves us with a single problem, 
distributed across resolutions, the multigrid method proposes an ensemble of cou- 
pled problems, one problem at each resolution, a somewhat simpler mental pic- 
ture. 

The multigrid method [40, 43, 152, 153, 228, 229, 289, 330] is widely used and doc- 
umented; this section emphasizes the intuition behind the method, but does not give 
a detailed and comprehensive derivation of the method, and the reader is directed 
towards the many references. 

For precisely the same reasons that a multiresolution change-of-basis was proposed 
in Section 8.4 — the locality of operator A, the slow spread of information from one 
state element to another — we now consider the multigrid algorithm, a multiresolu- 
tion approach to multidimensional linear systems problems. The effect of locality is 
made clear in Figure 9.9; after very few iterations the portions of the error at high 
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Spatial Frequency 

Fig. 9.9. The power spectrum of the estimation error as a function of Gauss-Seidel iterations: 
After only five iterations the high-frequency error has disappeared, however hundreds or thou- 
sands of iterations would be required to reduce the large-scale error. 



spatial frequencies have been eliminated, but hundreds or thousands of iterations 
would be required to significantly reduce long-range errors. 

The key idea of multigrid is this: 

1. If only a few iterations of a simple linear system method can eliminate high- 
frequency errors, then the error quickly becomes smooth. 

2. However, a smooth (slowly varying) problem can be subsampled, making the re- 
maining long-range errors more local. 

3. These local errors will then once again reduce relatively rapidly when a linear 
system method is applied to the subsampled problem. 



Thus, given an initial estimate J. , after k Gauss-Seidel iterations the new estimate 
z(k) has a corresponding residual 



r(k) =b- Az(k) 



(9.70) 



such that r(k) is spatially smooth. Recall from Section 8.2 that pseudoinverse inter- 
polation S and subsampling F operators will satisfy FS = I, but SF ^ I. However, 
the smoothness of r means that it can be losslessly subsampled, 



SFr(k)~r(k). 



(9.71) 
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Algorithm 7 Multigrid 



Goals: Iteratively solve the linear system Ax = b with tolerance \\Ax — b\\ < C 
Function x = Multigrid(sca/es, A, b, x , f ) 

repeat 

Call x <- MG(scales, A, b, x) 
until \\Ax - b\\ < ( 

Goals: Perform one complete multigrid cycle, recursively from finest to coarsest scale 
Function x = MG(sca/e, A, 6, x ) 

From x , perform k iterations of Gauss-Seidel, giving x k 
if scale > then 
A^S T AS 
b^S T (b-Ax k ) 

for i <— 1 to p do 

x_i ^— MG(scale — 1, A, 6, x i _ 1 ) 
end for 

x p <— %k + ^^p 

From x , perform k iterations of Gauss-Seidel, giving x 
else 

We are at coarsest scale, solve exactly for x. 
end if 



The subsampled problem Fr(k) will converge rapidly at first: it is smaller in size 
than the original problem, thus each iteration is of reduced complexity, and Fr(k) 
will be rougher (less smooth) than r(k) and therefore more quickly smoothed by 
Gauss-Seidel. This concept is illustrated in Figure 9.10 for a one-dimensional prob- 
lem, repeatedly subsampled three times. 

Recall from (9.21) that e(k) represents the error in solution z(k) after k iterations, 
where 

r(k) = -Ae(k). (9.72) 

For most priors A is a spatial smoothing operator, so we need to oversmooth r to 
ensure that e is smooth, allowing it to be estimated 

e(k) = Sz c (9.73) 

by coarse-scale approximation z c . The idea, then, is that z provides the local details, 
and Sz c the nonlocal, large-scale parts. z c must satisfy 

Az = b (9.74) 

A(z + Sz c ) = b (9.75) 

S T A Sz c = S T (b- Az) (9.76) 

from which we can derive the subsampled linear-system 
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Fig. 9.10. The multigrid principle: A simple, local iterative method, such as Gauss-Seidel, 
very quickly eliminates high-frequency (local) errors, but only very slowly eliminates low- 
frequency (distant) ones. Each panel plots the log power spectrum of the error, as in Fig- 
ure 9.9. In each row five additional GS steps have been performed relative to the previous 
row; additional columns are introduced as the system is subsampled (F) to coarser scales. The 
leftmost column compares GS and multigrid: it is clear that repeated subsampling leads to a 
much faster reduction in error, as opposed to finest-scale GS with no subsampling. The bolded 
panels show the sequence followed by the multigrid algorithm. 
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Fig. 9.11. A subsampling operator must be chosen to define the coarser scale. A cell-centred 
approach, left, is very simple, however the coarse-scale pixels (x) are in a different loca- 
tion from the fine-scale ones (o), making the assertion of boundary conditions possibly more 
difficult. In many circumstances the vertex-centred approach, right, is preferred. 




(9.77) 



This transformation we have seen before, in (8.7) or (8.84), and is nothing more than 
an implicit basis reduction over scale. 

This new linear system can then, itself, be subsampled after a few iterations of 
Gauss-Seidel, and the whole smoothing-subsampling process repeats recursively 
until the original problem has been subsampled to such a degree that it can be solved 
exactly (e.g., by matrix inversion). The recursive procedure is illustrated in Algo- 
rithm 7. 

Although the multigrid algorithm may feel rather complex, in fact the actual im- 
plementation of the multigrid method is fairly straightforward, as illustrated by the 
author's Matlab implementation in Algorithm 8, very nearly line-by-line mirror- 
ing the pseudocode in Algorithm 7. The implementation in Algorithm 7 is based 
on the explicit evaluation and storage of matrix A at each scale, however because 
multigrid only ever requires the matrix-vector product Az it is possible, and indeed 
highly desirable for very large problems, to represent matrix A implicitly via a ker- 
nel, sparse matrix, or function. A fairly elegant approach for computing a functional, 
implicit form of A at every scale is shown in Algorithm 9. Because the Gauss-Seidel 
iterations used in Algorithm 8 require access to a regular, non-functional A, in Algo- 
rithm 9 the Gauss-Seidel iterations at each scale have been replaced with conjugate 
gradient. 

There are a number of choices within the multigrid algorithm, the most fundamental 
of which is the choice of scale-to-scale subsampling operator. The subsampler may 
be cell-centred or vertex-centred [190], as illustrated in Figure 9.11, the subsampler 
may be pointwise (frequently leading to an unstable iteration), more commonly a 
bilinear interpolation, or based on wavelets [44, 316]. 
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Algorithm 8 MATLAB Multigrid implementation I, mirroring Algorithm 7 

% For convenience, mg is written in terms of subsampler F rather than interpolator S 
function x = mg( scale, a, b, x, p, numgs ) 
if (scale > 0) 

% Do GS iterations 

x = gs(a,b,x,numgs); 

% Make simple 2x2 subsampler F 
f = subsampler(scale); 

% Create coarser problem AC * XC = BC 
ac = sparse(f*a*f ); 
be = f*(b-a*x); 
xc = 0*bc; 

% Repeatedly call coarser scale 
for i=l:p, 

xc = mg( scale- 1, ac, be, xc, eye, numgs ); 
end 

% Add coarse solution back in to current scale and GS iterations to smooth 

x = x + f *xc; 

x = gs(a,b,x,numgs); 



else 



% We are at coarsest scale, solve outright 

x = a \ b; 



end 



A second choice is that of variable p in Algorithm 7, addressing the question of cycle 
— how the algorithm switches between scales. Most common are the "V"-cycle 
(p = 1) the "W" cycle (p = 2). As p is increased, a higher proportion of time is 
spent at the coarser scales; whether this helps will be a function of the problem and 
needs to be determined empirically. 

Other choices include whether to do GS, as in Algorithm 8, or possibly conjugate 
gradient, as in Algorithm 9, and then whether before, after, or both before and after 
the coarser scale call. There are also practical considerations, whether the system 
matrix A and the subsampler S are dense or implicit, and whether the matrix repre- 
sentation may be a function of scale (e.g., implicit at large, fine scales and dense at 
small, coarse scales). 

Figure 9.12 plots the convergence results, comparing multigrid and preconditioned 
conjugate gradient. The multigrid results were computed using the code in Algo- 
rithm 9. 
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Algorithm 9 MATLAB Multigrid implementation II, using implicit, computed sys- 

tem and scale-transfer matrices 

% afn is a function handle, such that afn(x) evaluates A*x 

% subfn and subfnt are function handles to the subsampler and its transpose 

function x = mg( scale, siz, afn, b, subfn, subfnt, x, p, numcg ) 

if(scale>0) 

% Do CG iterations 

x = pcg(afn,b,[ ], numcg, [ ],[ ],x); 

% Create a handle to implicit coarser-scale problem 
ac = @(x) subfn(afn(subfnt(x))); 
be = f*(b-afh(x)); 
xc = 0*bc; 

% Repeatedly call coarser scale 
for i=l:p, 

xc = mg( scale- 1, ac, be, subfn, subfnt, xc, eye, numgs ); 
end 

% Add coarse solution back in to current scale and CG iterations to smooth 

x = x + subfnt(xc); 

x = pcg(afn,b,[ ], numcg, [ ],[ ],x); 



else 



% We are at coarsest scale, solve well 
x = pcg(afn,b,[],100,[],[],x); 



end 




CG Iterations (or equivalent) 

Residuals for Large Problem 




MG Estimates (One Iteration) 



Fig. 9.12. A comparison of the convergence of Conjugate Gradient and Multigrid. The multi- 
grid estimates were taken after one iteration of multigrid, corresponding to twenty iterations of 
GS or CG (compare with Figure 9.6). The convergence results, left, are based on the 257 x 257 
problem of Figure 9.8, clearly showing the bilinear vertex-centred multigrid outperforming the 
cell-centred case. 
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Application 9: Surface Reconstruction 



The problem of surface reconstruction [27, 160, 161, 304] is a topic of interest in the 
field of computer vision, involving the estimation of an unknown surface based on a 
set of noisy measurements of some function of the surface, possibly of its derivatives, 
and based on a prior model to regularize the problem. 

We have seen repeated examples of surface reconstruction, in the context of mem- 
brane and thin-plate models in Chapter 5, and in particular the two-dimensional re- 
constructions throughout this chapter. 

The surface S of interest is a two-dimensional function s(x, y), twice differentiable, 
with gradients 



p(x,y) = s x (x,y) 



9s(x,y) ds(x,y) 

q(x, y) = s y (>, y) = — — — , (9.78) 



dx :l v ' dy 

In terms of choosing a state to represent the surface, we have three alternatives: 



1 . Estimate the gradients only, 



z(x,y) = 



p(x,y) 
q(x,y) 



(9.79) 



in which case the estimated surface s must be found by integrating p, q. For the 
integral to be unique (path-independent), p, q must be gradients of a surface, im- 
plying a consistency constraint [118, 159] 



(pdx + qdy) = 



(9.80) 



must hold over all closed paths in the plane. Because the surface is not com- 
puted until after estimation, this formulation does not allow measurements of 
the surface, and has limited use. There are important exceptions however, such 
as shape-from- shading [118, 161, 167,201,251,349], which have only gradient 
measurements. 



2. Jointly estimate the surface and its gradients 

z(x,y) = 



s(x,y) 
p{x,y) 
q(x,y) 



(9.81) 



in which case it is straightforward to have measurements of either gradients or 
surface, or to assert prior constraints on the gradients. However, there is the bur- 
den of constraining the relationship between s and p,q: 



II 



(s x -p) + (s y -q) dxdy. 



(9.82) 
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3. Estimate the surface directly, 

z(x,y)= [s(x,y)], (9.83) 

in which prior constraints on or measurements of the gradients need to be related 
to differences of surface values. Although straightforward, this approach may not 
be preferred in cases where we wish to calculate gradient error statistics, or where 
only measurements of gradients are available. 

In all cases, given linear measurements and prior constraints, the resulting problem 
reduces to the familiar linear system 

(L T L + C T R~ l C) z(m) = C T R- l m. (9.84) 

The linear systems methods in this chapter are well represented in the surface recon- 
struction literature: 

• Conjugate gradient, for surface reconstruction using hierarchical interpolants 
[299], 

• Conjugate gradient, for surface reconstruction using wavelet preconditioning 
[345], 

• Multigrid, for surface reconstruction with cuts and folds [303, 304], 

• Multiscale, related to nested dissection, for surface estimation with error statistics 
[108]. 



For Further Study 



There are a great many numerical methods texts, most of which cover the solving of 
linear systems; one such text is the classic by Dahlquist and Bjorck [78]. The text by 
Hackbusch [153] parallels the sequence of methods of this chapter, but in far greater 
detail, and is strongly recommended. 

For a more advanced look at sparse methods and solvers, the book by George and 
Liu [131] is widely cited. 

For iterative methods more specifically, the survey paper by Shewchuck [285] is 
well worth reading to better understand conjugate gradient. The recent text by Saad 
[278] is comprehensive and up-to-date, giving extensive coverage to GMRES and 
Krylov methods. Saad's book does not, however, look at multigrid; for accessible 
treatments of the multigrid method the texts of Briggs [43], Hackbusch [152], Mc- 
Cormick [228], and Wesseling [330] are recommended. 
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Although the conjugate-gradient method and the Cholesky decomposition are stan- 
dard components of most numerical packages, such as MATLAB, Numerical Recipes 
[264] remains a valuable resource for readers seeking to do their own implementa- 
tions, or to develop variations on standard methods. 



Sample Problems 

Problem 9.1: Numerics of Linear Systems 

The solution to a linear system Az = b can be written algebraically as 

z = A~ 1 b (9.85) 

under the assumption that A is square and invertible. 

Numerically, however, it was claimed in Section 9.1 that explicitly inverting A 
is inferior to Gauss / Cholesky methods. In MATLAB, the two alternative imple- 
mentations to (9.85) would be written as 

Z = inv(A)*b; Z = A \ b; (9.86) 

Construct a linear system, for example the one-dimensional interpolation of 
Problem 9.2, and compare the computational time and accuracy of the two com- 
puted estimates in (9.86). 

Problem 9.2: Iterative Solutions and Interpolation 

We wish to reproduce and extend some of the results of Example 9.2. We have a 
process of length n = 100, of which we measure first and last elements: 



= m = Cz + v 
We propose two possible linear systems to solve: 






1. First Order: A x = C T C + 5 • L\L X — ► A x z = C T m 

2. Second Order: A 2 = C T C + 2000 • L^L 2 — ► A 2 z = C T m . 

If the iteration is stable, as determined by its eigenvalues, then its time constant 
is determined by the eigenvalue which is largest in magnitude: 

-1 



In (max* (| Ai|)) 
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(a) Are GJ and GS stable for both the first- and second-order interpolation prob- 
lems? 

(b) For n = 3,4,. ..,100 plot the time constant r for both GJ and GS for the 
first-order problem. What do you observe? 

(c) For n = 100, plot all of the eigenvalues in the complex plane. Generate two 
plots, one for each of GJ and GS, for the first-order problem. What do you 
observe? 

(d) An eigenvalue describes how rapidly an error decays, where the shape of the 
error is described by the associated eigenvector. Plot two eigenvectors for the 
first-order GS problem, one corresponding to an eigenvalue near 1.0, another 
corresponding to an eigenvalue near 0.0. What do you observe? 

(e) For n — 100, apply GS to the first- and second-order problems, initializing 
with Zjq = 0. What do you observe? 

Problem 9.3: Two-Dimensional Interpolation 



We wish to solve a two-dimensional estimation problem. Let Z be a 100 x 100 
random field, characterized by constraints 

Ly 

with free boundary conditions. Use a regularization constant A = 5000. 
We have four measurements, assumed to have unit- variance noise: 



m 



10 




^25,25 + Vl 


10 




^25,75 + V 2 


10 




^75,25 + ^3 


10 




^75,75 + ^4 



(a) The system matrix A is a large, sparse 10000 x 10000 array. Do not create 
matrices L 2 or A, not even sparse forms. 

Rather, develop a method, an algorithm (e.g., in MATLAB) for computing 
the two parts of the matrix-vector product 



Az = (C T C ■ 



\LlL 2 )z 



C T Cz- 



XL 2 L 2 z 



The term C T Cz is easily implemented; you will probably want to use a 
convolution to calculate L^L 2 z_. 

(b) Given the matrix-vector product in part (a), you can now implement an it- 
erative solver. Implement CG and plot the results after 10, 100, and 1000 
iterations. 

(c) Run CG for many iterations, until you feel the method has converged, to ac- 
quire z opt . We can calculate the mean-square error at any iteration by com- 
paring the iterative estimate z { to the optimum: 
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MSEi = Hi- - z t extopt\\. 

Plot MSEi, \<i< 1000 for CG. Comment, 
(d) Modify your algorithm from (a) to yield the needed nonzero elements of A 
in order to implement GS. Plot MSEi, comparing GS and CG. 

Problem 9.4: Multigrid 

Problem 9.3 set up a two-dimensional estimation problem. Solve the estimation 
problem for Z using multigrid. It may be more convenient to choose al28xl28 
or 129 x 129 domain to allow repeated subsampling. 

Problem 9.5: Open-Ended Real-Data Problem — Multigrid 

The strength of multigrid lies in its ability to solve poorly conditioned, nonlo- 
cal problems. Let's generate a set of estimates, such as those in Figure 5.17 on 
page 163, now using multigrid. 

In particular, create a truth random field 

Z xv = X ttJt - 50 < x, y < 50 

x,y 10Q _ ,*_ 

from which we create noisy measurements 

M = Z + V V~a 2 I 

with a noise variance of a 2 = 1. 

The analytical forms which preceded Figure 5.17 were expressions for P, whereas 
multigrid requires system matrix A, which is expressed in terms of P _1 . To 
avoid inverting a large matrix, we can choose P _1 directly by selecting a 
sparse/Markov prior, such as membrane or thin-plate, with an appropriately cho- 
sen correlation length, for example based on Figure 5.15. 

Create a "satellite track" measurement pattern, such as that shown in Figure 5.17. 
Plot the estimates and discuss the number of multigrid iterations required for 
convergence, as a function of the choice of prior and the correlation length. 



10 

Kalman Filtering and Domain Decomposition 



In this chapter we examine the solution of large dynamic estimation problems, prob- 
lems which may stem from one of two sources: 

1 . Problems which are inherently spatio-temporal, such as the processing of video 
(spatial two-dimensional images over time) or remote sensing (the earth's surface 
or an atmospheric/oceanic volume sampled over time); 

2. Multidimensional static problems, which have been converted to dynamic form, 
as discussed below. 

In particular, rather than solving an inverse problem as one huge linear system of 
equations, as was undertaken in Chapter 9, here we want to benefit from the dynamic 
aspects of the problem to break it into smaller pieces. 

Recall from Section 4.1.2 that we were able to show the algebraic equivalence or 
duality between a given dynamic problem 

z(t + 1) = A(t)z(t) + B(t)w(t) w(t) ~ A/"(0, 1) (10.1) 

m(t) = C(t)z(t) + v(t) v(t) ~ A/"(0, R(t)) (10.2) 

and an associated static problem (from (4.24)) 

m = Cz + v z~fap) £~(Q,JR). (10.3) 

In that case the pieces of the dynamic problem, spread over time, were concatenated 
to construct a single static problem. Given that it was possible to construct such an 
equivalence, the obvious question is whether the process can be turned around: can 
we convert a static problem into a dynamic one? 

There are two reasons why such a conversion to a dynamic problem offers an attrac- 
tive alternative to the linear systems methods of Chapter 9: 
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1. The Kalman filter produces both estimates and estimation error statistics. In some 
cases these error statistics are badly needed, for example to interpret the statistical 
significance of the estimation results, or possibly to further incorporate measure- 
ments via data fusion (as in Section 3.4). Although it is possible to acquire estima- 
tion error statistics from linear systems methods, in most cases computing these 
statistics is computationally prohibitive (except for the fully stationary/periodic 
case of the FFT). 

2. If it is possible to break a large problem into relatively small pieces, there may be 
great computational advantages over solving the linear system as a whole. In par- 
ticular, given aniVxA^ random field, if it is possible to express the N columns of 
the random field as a dynamic process, each of length N, then (ignoring sparsity) 
we have the complexities 

Static Complexity 0(N 6 ) Dynamic Complexity 0(N • TV 3 ). (10.4) 

Thus the dynamic approach possibly offers a reduction in complexity of 0(N 2 ) 
relative to direct static solvers. 

The crux of the matter, then, is the following key question: 

Do the portions of a given static problem obey a dynamic relationship? 

Suppose we have a zero-mean random field z~P, which we break into two pieces 
with known statistics 



Pn P12 
P21 P22 



(10.5) 



We can assert a linear dynamic relationship between z_ x and z 2 : 

z.2 = A *i +Bw E [z^] =0 w~I m (10.6) 

However, the relationship in (10.6) implies the cross-statistics between z_ x and z 2 , 
therefore we should be able to learn A, B in (10.6) from the given, static cross- 
statistics in (10.5): 

P 12 = E[z^] =E[z 1 (Az 1 + Bwf] (10.7) 

= PnA T + (10.8) 

therefore 

A = P&Pn 1 = P2iPn- (10.9) 

Similarly, from the covariance of z 2 : 



therefore 
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P 22 =E[z 2 g] (10.10) 

= E[{A^ + Bw){A^ + Bw) T ] (10.11) 

= APnA T + PP T (10.12) 



BB T = P 22 - APn^ T . (10.13) 



We see, therefore, that the static-dynamic conversion is possible. Barring computa- 
tional limitations, the Kalman filter can then immediately be applied to the problem. 
This chapter surveys some of the more common ways in which dynamic methods are 
applied to large-scale estimation problems, with Section 10.2 in particular looking at 
sparse / reduced-order forms of the Kalman filter. 



10.1 Marching Methods 

Suppose we wish to solve a d-dimensional spatial problem, for which a spatial prior 
is given, using a Kalman filter to proceed dynamically along a sequence of (d — 1)- 
dimensional slices; that is, row-by-row or column-by-column for a 2D problem, and 
plane-by-plane in 3D. 

We are given a two-dimensional random field 

Z=[z 1 z 2 ...z N ]. (10.14) 

To model this sequence of columns dynamically, we seek a linear model of the form 

Zi+i = fUi, Zi-n---,^) +m.i- (10.15) 

Ideally z i+1 would, for example, be a function of z_ { only. From Chapter 6 we rec- 
ognize this as a statement of Markovianity. Indeed, if Z is first- or second-order 
Markov, 1 then 

^LLSE^+iUi, ^_i, • • • ,*J = ^LLSE^+iki]- (10.16) 

Similarly, if Z is third- through fifth-order Markov, then 

^LLSE^i+iUi, ^i-l,.-.,^J = ^LLSE^i+iUi, ^-l] • (10.17) 

Since we are seeking linear dynamic models, leading to linear estimators, we assert 
a Gauss-Markov autoregressive form, 

Z4+1 =A ( i 1) z i + B i w i , (10.18) 



1 See Figure 6.5 for the definition of Markov random field order. 
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Covariance of First- Order Marching 



Covariance of Second- Order Marching 



Fig. 10.1. The above panels show how the original static prior covariance is represented by the 
marching method. For each increment in marching order, one additional band on each side of 
the diagonal is modelled exactly. The success of the marching method depends on how well 
more distant bands can be approximated by the modelled statistics of the near-diagonal bands. 



a first-order marching model corresponding to (10.16), or 



u+i 



&Z* 



A?\ 



4-1 



■ BiWi, 



(10.19) 



a second-order 2 marching model corresponding to (10.17). Clearly these generalize 
to the jth order case 



*4+l 



^i Z.i 



■4Vi 






■BjWj. 



(10.20) 



We have already seen (Section 4.1.1) that such higher-order models can be cast into 
first-order Gauss-Markov form by state augmentation. That is, under the following 
change of variables 



(10.21) 





Ai = 


K } 4 (2) • 
I 
I 


• 4 or 


B(t) = 


~B, 




_2Li-j+l_ 














then we obtain the standard first-order dynamic recursion 

z(t + 1) = A(t)z(t) + B(t)w(t). 



(10.22) 



2 One-dimensional and multidimensional Markov orders have differing definitions, so it is 
indeed correct that a second-order ID marching process {z { } corresponds to a third- or 
fourth-order 2D random field Z. This fluidity or convertibility of order we have seen before, 
in the autoregressive state augmentation of Section 4.1.1. 
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Zl,i 
Z2,i 

Zi,i 

ZN,i 


Zl,i+1 

Z2,i+l 

Zi,i+1 
ZN,i+l 



Zl,i+2 

Z2,i+2 

Zi,i+2 
ZN,i+2 



Fig. 10.2. Is the marching model modelling too much? Consider the first-order model of 
(10.18), as sketched here, which represents column z i+1 in terms of the previous column 
z i . This means, however, that we represent all of the joint statistics of the columns, includ- 
ing those of pixels z\^ and znj+i, spatially separated by N pixels, whereas the relationship 
between 21,1-1 and zi,i+i, spatially separated by only two pixels, is not explicitly modelled. 



That is, the Kalman filter can, with no modifications, solve first- and higher-order 
multidimensional marching problems. 

For the TV x N two-dimensional case, the computational complexity of computing 
estimates has been reduced from O (N 6 ) for direct solvers, to 0[N ■ (Nj) 3 ) for 
the jth order marching approach, from which we can clearly see the motivation for 
making j as small as possible. In the N x • • • x TV case in d dimensions, the direct 
approach has complexity 0(N M ), whereas the marching approach has complexity 
G(N-N^ d - 



-i) f) 



There remain two questions: 

1. What, exactly, do we lose in going to the dynamic/marching approach? The re- 
duction in computational complexity must bring with it some limitations or ap- 
proximations in the problem solution. 

2. Is the marching method computationally feasible, and is it possible to reduce the 
state size further for additional computational benefits? 



In terms of the former question, indeed, some approximations are made. The jth 
order dynamic model precisely represents the joint statistics of j + 1 consecutive 
columns, however the joint statistics of columns more than j steps apart are modelled 
implicitly. Figure 10.1 illustrates this behaviour: each increment in j extends the 
exact representation of the prior co variance by one additional band. Whether the 
implicit statistics are correct depends on the Markov order of the random field. 

In terms of the latter questions, consider Figure 10.2. The first-order marching model 
represents all of the joint statistics of the two columns, including those of distantly- 
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separated state elements z\^ and zn,i+i, whereas the statistics of pixels only two 
columns apart, such as between zi^-i and zi^+i, are only implicitly modelled. It 
would seem that representing the joint statistics of each column is excessive, and that 
only a portion of the column should somehow be needed: this is the goal of the strip- 
based, reduced-update, and reduced-order Kalman filters discussed in Section 10.2. 



10.2 Efficient, Large-State Kalman Filters 

At this point we assume that we have been given a large, dynamic estimation prob- 
lem, whether inherently dynamic (a spatio-temporal problem) or not (a spatial prob- 
lem, converted to a dynamic one via marching). 

In any event, we have been given a dynamic process model and associated dynamic 
measurements and measurement model: 

z(t + 1) = A(t)z(t) + B(t)w(t) w(t) ~ A/"(0, 1) (10.23) 

m(t) = C{t)z(t) + v(t) v(t) ~ A/"(0, R(t)) , (10.24) 

where, by definition, the model of (10.23) is first-order Gauss-Markov (although 
possibly representing a higher-order spatial Markov process by state- augmentation 
(10.21)). 

There are two basic reasons why it is much harder to find efficient dynamic algo- 
rithms compared to static ones: 

1. In very many cases (e.g., membrane, thin-plate, and Markov priors) the static 
problem is characterized by very sparse matrices L, Q, however in the Kalman 
filter sparsity tends to disappear: 

• In the prediction step 

P(t + l\t) = A(t)P(t\t)A T (t) + B(t)B T (t) (10.25) 

unless A is diagonal, generally P(t + l\t) is less sparse than P(t\t). 

• In the update step, 

K(t) = P(t\t - l)C T {CP(t\t - l)C T + R}' 1 , (10.26) 

the gain K will generally be a dense matrix, since the inverse of a sparse 
(CPC T + R) is normally dense. 

2. In many cases the prior model for the static case is simple, stationary, sparse, or at 
least regularly structured; we have taken advantage of such attributes to develop 
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a variety of efficient approaches. For example, even the complex, nonstationary 
prior of Example 5.1 on page 159 admits an implicit, algorithmic implementation 
of the prior. 

In the Kalman filter, on the other hand, even if the dynamic process starts with a 
simple, stationary, well- structured covariance P , over time the covariance P(t\t) 
becomes complex, nonstationary, and unstructured, as was seen in Application 4, 
because the behaviour of P(t\t) depends on the interaction of the problem dy- 
namics with the location C and quality R of the measurements. 

If the number of elements in z is modest, then a direct implementation of the KF is 
possible, treating all of the system matrices A, B and estimation error covariances 
P{t\t-l),P{t\t) as dense. 

There are two limitations to the straightforward application of the dense Kalman 
filter to large-scale problems: 

1 . Computational complexity, specifically due to matrix-matrix multiplication and 
matrix inversion; 

2. Storage complexity, due to the storage of dense covariances. 

A great many approaches have been developed for implementing Kalman filter-like 
solutions to large-scale dynamic estimation problems, in many cases customized to 
the specific attributes of a multidimensional estimation problem. Some of the most 
common approaches are summarized in the following sections: 

• Section 10.2.2 — Steady-State KF: do not update or predict covariances 

• Section 10.2.3 — Strip KF: break the problem into pieces 

• Section 10.2.4 — Reduced-Update KF: limit the extent of the model 

• Section 10.2.5 — Sparse KF: implicit representation of covariances 

• Section 10.2.6 — Reduced-Order KF: use basis reduction to reduce state size 



10.2.1 Large-State Kalman Smoother 

In many cases z(t) is not a causal process, particularly when Z is marching spatially 
in solving a multidimensional static problem. In such cases we may wish the esti- 
mates z(t) to be computed acausally. In principle we have already discussed acausal 
filtering of dynamic processes using the RTS smoother in Section 4.3.3, however to 
compute smoothed estimates over t € [0, r] the RTS smoother must store P(t\t) for 
every time t in the interval. In those cases where z is a large problem, the storage 
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Example 10.1: Interpolation by Marching 



Suppose we have a stationary, periodic random field 



v 



^ 



We have sparse measurements of the random field, therefore the FFT method 
cannot be used for estimation, so we use the Kalman filter and march, column by 
column. The column dynamics can be learned from the prior spatial statistics 



Matrix A 



Matrix BB T 



where the sparse structure of A is consistent with the random field being Markov. 
With the model in place, we can use the Kalman filter to generate causal (left-to- 
right) or anticausal (right- to-left) estimates: 



Sparsely Measured 



Densely Measured 
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diag(P) 



diag(P) 



Example continues . . . j 
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Example 10.1: Interpolation by Marching (cont'd) 



We can clearly see how the measurements and error variances respond unidirec- 
tionally to the measurements, especially in the sparsely-measured cases (left). 

It is possible to consider merging the two separate (causal and anticausal) Kalman 
filter results, based on (10.28): 



Sparsely Measured 



Densely Measured 




There are a number of artifacts present, which stem from merging the results on 
the basis of error variances only, rather than the full covariance (which would 
be too large to store in multidimensional problems of interest). Nevertheless the 
merged estimates are considerably more appealing, and probably more useful, 
than the unidirectional ones on the previous page. 



of r covariances P(l|l), . . . , P(r\r) is almost certainly prohibitive, leaving us with 
three alternatives: 



1. Don't smooth, solve the estimates causally only. For many true temporal esti- 
mation problems this is likely to be appropriate, but less so for spatial-marching 
ones. 

2. Perform suboptimal smoothing, whereby the covariances P(t\t) are stored in 
banded / kernel / sparse inverse or some other compact form. Running an in- 
formation Kalman filter (Section 4.3.1) in a sparse-matrix form, as discussed in 
Section 10.2.5, may indeed be practical. 

3. Perform suboptimal smoothing by running the Kalman filter twice, once causally 
and once anticausally, and merging the two sets of results, as illustrated in Exam- 
ple 10.1. Thus we compute 



l(t|0,...,t),P(t|0,...,t) 
which are merged as 



and 



l(t|t,...,r),P(t|t,...,r) (10.27) 
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Zj(t\0, . . . , t)/P hl (t\0, ...,*) + Zj(t\t, . . . , r)/P M (t|t, -,r) 

M ' ' " ' " ' T j -l/fl.iW + V^M(*|0, ■■■,*) + V^mM*. • • • , r) 

(10.28) 

for example, where the — l/P^it) term in the denominator is to subtract out the 

prior, which would otherwise be double-counted in the statistics. Computing the 

merged results still requires saving z and diag(P) at each point in time, for both 

directions, requiring a storage of 4r times greater than z alone, although only 

four times greater than the storage of Z, the underlying static random field. 

Practical solutions to the Kalman smoother therefore rely on efficient methods for 
Kalman filtering, discussed in the following sections. 



10.2.2 Steady-State KF 

If the dynamic system is temporally stationary, then the steady-state Kalman filter 
(Section 4.3.2) may be applied. Although large, matrices P(t\t), P(t\t — 1), K(t) are 
not a function of t and need to be calculated only once, leaving the relatively simple 
vector equations to recursively calculate z(t\t),z(t\t — 1). Since the vector equa- 
tions explicitly depend only on K, and not on P(t\t), P(t\t — 1), it is actually only 
K which needs to be stored. Finally, because K does not need to satisfy positive- 
definiteness it may readily be approximated, for example using a sparse/banded ap- 
proach (Section 5.3) to address storage complexity. 

It should be pointed out that the fully stationary case, stationary in all dimensions 
with periodic boundaries, should be solved using FFT methods (Section 8.3). How- 
ever, the steady- state Kalman filter requires only temporal stationarity, and not spatial 
stationarity or periodicity. 



10.2.3 Strip KF 

As was made clear in Figure 10.2, the regular Kalman filter will consider all of the 
interrelationships of elements in z. If z contains a set of pixels, distributed spatially, 
such as the column or row of an image, then almost certainly the interrelationships 
of distantly- separated state elements are not of great value. Therefore it is logical to 
consider breaking the state into a number of pieces, which is the essence of the Strip 
Kalman Filter [337]. 

The state vector z(t) T = \z\{t) Z2(t) • • • z n (i)] is divided into a number of pieces 
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Original domain 



Separated Strips 
Each with own KF 



s 




Estimates Transformed 
back to Original Domain 



Fig. 10.3. The Strip Kalman Filter: The Kalman filter state vector, here the row of an image, is 
partitioned into strips, such that a separate Kalman filter is run on each strip, with the estimated 
strips then recombined. 



z^tf = [ Zl (t) 



l(*) 



>(*)] 



Z2u 



m 



(10.29) 



for some appropriate strip width w and strip overlap A, where 

n 
w ~ - + A. 



(10.30) 



The resulting strips are processed separately, and the resulting estimates from the in- 
dividual strips concatenated to form an estimate of the original problem, as illustrated 
in Figure 10.3. 

This process of dividing into pieces, independent processing, and recombining is 
exactly the same as what was described under local processing in Section 8.2.3. In 
particular, compare Figure 10.3 with the overlapped transformation of (8.49) and 
(8.51) on page 259. 

Because the computational complexity of the Kalman Filter goes as the cube of the 
state size, whereas the original filter had a complexity of G(n 3 ) per row, the strip 
filter has a complexity of 

q-(- + A\ (10.31) 

per row, where q represents the number of strips, and A the overlap between the 
strips. Differentiating (10.3 1), we can find the optimum number of strips, minimizing 

computational complexity, as 

2n 
q op t = -^ (10.32) 



336 10 Kalman Filtering and Domain Decomposition 




Row to Estimate 

(a) 



Extent of Model 

(b) 



Augmented State 

(c) 



Fig. 10.4. The Reduced Update principle: Suppose we wish to produce estimates for a given 
row (a), based on some local half-plane model (b). However, under a marching scheme, even to 
estimate just that single scalar would require augmenting the row-state with multiple previous 
rows (c) to preserve the necessary history, a significant computational complexity. The RUKF 
performs the state update only within the model support in (b). 



although there may be other factors — density of measurements, appearance of esti- 
mation artifacts, spatial correlation length — which constrain the appropriate range 
of w and A 



10.2.4 Reduced-Update KF 

The one disadvantage of the strip-based Kalman filter is that by processing the strips 
independently, if the measurements are sufficiently sparse the boundaries of the strips 
will necessarily appear as artifacts in the estimates. Instead, the Reduced Update 
Kalman filter [188,337,338] produces state estimates by scanning along the state 
vector, one scalar element at a time. 

The principle of the RUKF is illustrated in Figure 10.4. Suppose that each image row 
contains N state elements, and that the 2D auto-regressive model in Figure 10.4(b) 
has a size ofqxq pixels. In terms of implementing a scalar, raster- scanning Kalman 
filter: 

• The prediction step is based only on the size of the model in Figure 10.4(b), 
involving C(q 2 ) state elements; 

• The update step, on the other hand, needs to update all of the elements in the 
augmented state in Figure 10.4(c), involving O(qN) state elements. 



Since we expect that N is somewhat larger than q, and because computational com- 
plexity in the Kalman filter goes as the cube of the number of state elements, it is 
clear from the above two points that the prediction step is relatively straightforward, 
and that efforts towards simplification should be concentrated on the update step. 
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The goal of the RUKF is to perform the update step, but involving only those 0(q 2 ) 
state elements in the model. Because these q 2 elements are those most likely to be 
meaningfully related to the measurement being processed, the degree of approxima- 
tion is modest. 

Adapting the notation from [188], we can summarize the RUKF in standard Kalman 
filter form as follows. Suppose we are given a stationary, causal Markov model as in 
(6.38): 

%t = ^ QsZt-s + w t , (10.33) 

seAf 

where M is a symmetric half-plane neighbourhood, t is a state index (here two- 
dimensional), and w is a driving process noise with stationary variance a 2 . Then the 
RUKF becomes 

Prediction: z(t\t - 1) = ^ g s z(t - s\t - 1) (10.34) 

P(t,i\t-1) = ^2g s P(t-s,i\t-l) i <t (10.35) 

s 

P(t, t\t - 1) = J2 9sP(t, t-s\t- 1) (10.36) 

S 

Update: z(i\t) = z(i\t - 1) + k(t - i\t) (m(t) - z(t\t)) (10.37) 

P(iJ\t) = P(iJ\t - 1) - k(t - i\t)P(tJ\t - 1) (10.38) 

k(i\t) = P(t,i\t- 1)- (P(M|£-l) + r(£)) -1 . (10.39) 

The structural similarity of the above RUKF to that of the standard Kalman filter in 
Chapter 4 is clear. 

Although a causal model is used, it is important to understand that the produced 
estimates are not causal: a given measurement m(t) is updated anti-causally into the 
neighbourhood J\f preceding t, and will be predicted causally into the state elements 
following t. Of course the degree of acausality is limited by the size of AT. 

10.2.5 Sparse KF 

Rather than breaking the problem into pieces (strip KF) or localizing the state 
(RUKF), one alternative is to take advantage of problem sparsity (Section 5.3): 

• Reduced storage complexity 

• Reduced matrix-matrix multiplication complexity 

• Possibly, reduced matrix inversion complexity 
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Of the matrices which appear in the Kalman filter, it is common for C, B, R to be 
sparse. Our focus is therefore on the dynamics matrix A and the error covariances 

p(t\t),p(t\t-iy. 

Dynamics: If the time-discretization is chosen sufficiently fine, then for most physi- 
cal systems A(t) will be sparse and nearly diagonal since a given state element z\ 
is, in a brief instant of time, likely to influence only those few other elements in its 
immediate vicinity. 

Covariances: Unless the dynamics A are diagonal, over time all of the elements of 
z_ become correlated and P(t\t\ P(t\t — 1) become full. Sparsity is therefore a 
matter of assertion or approximation. 

The estimation errors do tend to be locally correlated, however the correlation 
decays slowly with offset, and the covariances P(t\t), P(t\t — 1) are not naturally 
sparse. In many cases it may be possible to assume a Markov model for the errors, 
however, meaning that it is P _1 (t\t) for which a sparse form may be a reasonable 
approximation. 

A variety of sparse Kalman filters has been proposed [10, 65, 66, 307], most of 
them [65, 66, 307] based on sparsifying the inverse covariance, therefore actually 
implementing a sparse version of the Information Kalman filter (Section 4.3.1). We 
need to specify two things: 

1 . A sparsification operation, normally banded-Markov, 

2. A sparse-matrix inversion operation. 

Under the assumption that the covariances satisfy diagonal dominance, the series 
approximation to matrix inversion (5.21) can be used to efficiently implement the 
latter matrix inversion. 

The former sparsification operator seems straightforward in principle: for informa- 
tion (inverse) covariances, set to zero all elements outside of some set of bands. In 
practice, the need to preserve positive-definiteness makes the operation more subtle. 



10.2.6 Reduced-Order KF 

Because much of remote sensing is inherently global and time-dynamic in nature, 
such as the discussion in Application 4 on page 122, there has been a particularly 
large number of Kalman filters implemented for large-scale time-dynamic data as- 
similation, such as in [103, 121, 191] and in citations [2-12] within [111]. 

The Reduced-Order Kalman filter [7, 103,239,287] seeks a reduction of basis, as 
discussed in Section 8.2.2 and as illustrated in Application 8 on page 285, in order 
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to reduce the dimensionality of the state to an extent that a direct implementation of 
the Kalman filter for the reduced state is feasible. 

From (8.117), suppose that some operator A characterizes the discrete-time dynam- 
ics of a random process z': 

z'(t+l|t) =A(z'(t\t)). (10.40) 

In many cases, the dynamic problem involves fluctuations about a mean (Exam- 
ple 3.3); for example, the ocean sea- surface temperature exhibits modest variations 
about a relatively complex mean. If we subtract out the possibly time- varying mean 

z(t) = z!(t)-£(t) (10.41) 

then z is our process of interest. We can define a reduced state 

l(i) =Fz(t), (10.42) 

where F is not obligated to preserve any (possibly sharp and complex) details of z f , 
only the typically smoother deviations z about the mean. The degree of reduction, 
the number of rows in F, is chosen such that a regular, dense Kalman filter can be 
applied to z. 

Given updated estimates, the estimates of z' are found by inverting (10.41), (10.42): 

H(t\t)=£ + Sz{t\t), (10.43) 

where S is the pseudoinverse of F, as in Section 8.2.2. 

The high resolution estimates are passed through dynamics A, meaning that the pre- 
diction step takes place in the unreduced domain: 



z(t + l\t) = F A(z! (t) + Sl(t\t)) - z' (t + 1) 



(10.44) 



In the event that A is linear, then the deterministic and stochastic portions of the 
problem decouple (the Wold decomposition [248]), implying that the Kalman filter 
can operate mean-removed, as usual, and allowing the predict step to take place in 
the reduced domain: 

z(t + l|t) = FAS z(t\t). (10.45) 

A 



10.3 Multiscale 



All of the other approaches in this chapter have synthesized a dynamic problem by 
cutting a static problem into spatial pieces, such that the "time" variable t of the 
Kalman filter essentially indexes the rows, columns, or planes of a spatial problem. 
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Fig. 10.5. A dyadic representation of a one-dimensional process z\ the coarse representation 
x is split, such that x_ x is broken into two decoupled parts x n ,x 12 . The repeated splitting 
continues for J scales; the goal is to find a tree model such that the finest scale x 3 possesses 
the statistics of the represented process; that is, such that cov(xj) = cov(V). 



There are, to be sure, other ways of cutting up a spatial problem (see Problem 10.1, 
for example) than just rows or columns. As a significant departure from the rest of 
this chapter, we here consider allowing the Kalman filter "time" to index scale, rather 
than space. The basic idea, then, is to design a Kalman filter and RTS smoother to 
solve a static problem by iterating back and forth in scale. Intuitively this is vaguely 
analogous to the multigrid method of Section 9.2.5, except that multigrid is an iter- 
ative method, taking repeated passes at an estimation problem, whereas the Kalman 
filter / RTS smoother involve a single forwards and single backwards pass. 

We know from Section 8.5 that subsampling or a change of resolution destroys the 
Markovianity of a random field, so the reader may wonder what a multiscale Kalman 
filter has to offer. It is pertinent, at this point, to clarify a distinction between mul- 
tiresolution and multiscale: 

A Multire SOLUTION model involves representing a random field with state ele- 
ments that have varying spatial extent, such that coarse-scale pixels have a larger 
region of support than fine- scale ones. 

A Multiscale model represents a random field on a hierarchy, but the nature of 
the representation at coarse scales does not necessarily involve low-resolution el- 
ements. 



Therefore certain phenomena in remote sensing, such as 1/f power-laws [281] 
which involve structures at multiple scales, may be better served with a multireso- 
lution approach, whereas textures and single-scale random fields which are spatially 
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Markov may be more appropriately represented via a multiscale model. The mul- 
tiscale statistical model [69, 105, 169,214] described in this section includes both 
Markov/multiscale and multiresolution models, although we focus on the GMRF 
case. 

Suppose we are given the usual static problem 

^static ~ Static W> = C Z^ iic +V_ V ~ R m (10.46) 

To construct an efficient multiscale model for z 9 we seek to define a dynamic model 

x t+1 = A t x t + B t w t x ~P w t ~ I (10.47) 

such that the dynamic multiscale state x evolves to z after a finite number of steps J. 
That is, we need to find A t: B t such that the finest scale of the multiscale problem 
equals or approximates the given static problem, such that 

Static ^ Pj = AjPj_ x A T j + BjB T j (10.48) 

= Aj [Aj-iPj-aA^i + Bj-iSj.i] A T 3 + BjB T j (10.49) 



and that, ideally, A t and B t have some particular form which makes estimation par- 
ticularly easy and efficient. 

Since the original static problem is equivalent to the finest scale, this is the only 
time- step at which the dynamic problem has measurements: 



: u±j 



Cx 



±j^ u 



R. 



(10.50) 



Figure 10.5 illustrates the key idea to making this model efficient. The state is re- 
peatedly divided into decorrelated pieces, such that x 22 can depend on x n , but not 
on x 12 . This decoupling or decorrelation places a restriction on the dynamic model, 
such that A, B take the form 



£i = 



j -\\ 



h -2\ 
^23 



= Aix_q + B 1 w 1 = 



A 2 x 1 + B 2 w 2 



A 1: 
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£>24 



(10.51) 



£21 

Z22 

'2.23 

±24- 



(10.52) 

Since the process noise w terms are white, the above large dynamic equations can be 
broken down into smaller pieces; for example, 
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inr^lhidn (tiaxt) 

Multiscale State Estimates 








Multiscale Error Variances 



Fig. 10.6. Multiscale estimates and error variances [105]. The estimates are computed from 
the satellite altimetry data of Figure 1.3, where the higher accuracy of the estimates near the 
measured paths can be seen very clearly in the pattern of the error variances. 



x n = Aux + Buw n 
x 21 = A 2 ix 11 + B 2 \w 21 

X 2A = A 24i X 12 + #24^24- 



(10.53) 



Rather than the cluttered indices present in (10.53), we can generalize (10.53) in 
terms of a single index s, 



x(s) = A(s)x(\s) + B(s)w(s), 



(10.54) 



where | s is the index of the parent of s. This generalized structure applies to one- 
dimensional dyadic trees, two-dimensional quad-trees, and indeed to any tree struc- 
ture in any number of dimensions. 

If we define each state to be some linear function of the underlying random field: 

x(s) = S(s)z (10.55) 

then the statistics at the finest scale allow us to infer the multiscale statistics, 



•=rf~\T 



Z ~ Pstatic => P(s) = COv(x(s)) = 5 (s)P s tatic- ( 

and from those the model parameters in (10.54): 
B(s)B T (s) = P(s) - A(s)P(U)A T (s). 



(10.56) 

(10.57) 
(10.58) 



It is not a coincidence that these equations appear very similar to the marching dy- 
namics of (10.9), (10. 13). The multiscale estimator is essentially a distributed march- 
ing algorithm, marching over the scales of a tree, rather than across space. 
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First / Second-order Markov 



Third / Fourth-order Markov 



Fig. 10.7. If the underlying prior model is Markov, then keeping the pixellated values is suffi- 
cient to conditionally decorrelate the process into four quadrants. If each of those quadrants is 
further broken down into quadrants, we get a quad-tree whose finest scale statistics precisely 
equal the original Markov model. The number of rows or columns to keep in the state is a 
function of the Markov order of the underlying prior. 



The multiscale estimation algorithm [69, 105, 169,214] is essentially like the Kalman 
smoother on the tree structure, with two exceptions: 

1 . In order for any measurement at any tree index to be able to influence the estimate 
at any other tree index, it is necessary to run the Kalman smoother in reverse, first 
filtering from fine-to-coarse, and then smoothing from coarse-to-fine. 

2. In the tree a given node can have multiple child nodes, thus some sort of merge 
step is required to combine the information from multiple children in computing 
the estimate at the parent. 

The resulting estimator performs an upwards pass, from fine to coarse, producing 
estimates x u (s) with uncertainty P u (s), followed by a downwards pass (the RTS 
smoother), producing estimates x(s) with covariance P(s). An example of multi- 
scale estimates and estimation error variances is shown in Figure 10.6. 

If x(s) E l n ' s ', then the complexities of the multiscale approach are 

° ( J2 n2 ( s ) ) Stora S e ° ( J2 n3 ( s ) ) Com P utation - (10.59) 

The remaining question is how to effectively select the state x(s) = S(s)z in order 
to satisfy the two multiscale objectives: 



1. n(s) is small at each tree index, for computational efficiency; 

2. The finest-scale multiscale statistics Pj equal the given statistics P s tatic- 
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Fig. 10.8. Three levels of a multiscale hierarchy, with states chosen appropriate for a first- 
order Markov field, as in Figure 10.7. The purpose of each state is to hold that information 
which allows its four quadrants to be conditionally decorrelated. 



Suppose we are given a two-dimensional static problem with a first-order Markov 
prior. Then if H(0) selects all of the pixels in the middle row and column, as illus- 
trated in Figure 10.7, then conditioned on the root state x(0) the four quadrants are 
conditionally decorrelated, precisely the decoupling required in (10.51), and philo- 
sophically clearly related to the nested dissection of Section 9.1.3. 

There is, however, no reason to content ourselves with limiting the decomposition of 
the field into four quadrants. We can proceed further, creating boundaries similar to 
those in Figure 10.7 within each of the quadrants, thus we can continue the successive 
subdivision of the field into smaller pieces, as illustrated in Figure 10.8. 

The computational complexity is dominated by the large state at the tree root, with 
the complexity geometrically decreasing with scale. The computational complexity 
is shown as a function of dimension in Table 10.1. 

A variety of generalizations may be considered: 



• The point measurements of the original static problem are normally associated 
with the individual pixels at the finest level of the tree, however with the appro- 
priate definition of x(s) at coarser levels of the tree, nonlocal measurements can 
also be accommodated. 
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Dimensions 



Problem Size # Pixels Multiscale Complexity 



ID 
2D 
3D 



n 

n x n 
n x n x n 



N = n 
N = n 2 
N = n 3 



Q(n) = O(N) 
0(n 3 ) = OiN 1 - 5 ) 
C(n 6 ) = G(N 2 ) 



Table 10.1. Computational complexity of the multiscale method, as compared to a direct 
solver, as a function of dimensionality. 




Fig. 10.9. The boundaries of Figure 10.7 may be much more densely sampled than needed. It 
may be a very reasonable approximation to subsample (left) or average (right) along the quad- 
rant boundaries. The four quadrants will no longer be perfectly decorrelated, although very 
nearly so. Sample results, based on a subsampled state, are shown in Figures 10.10 and 10.1 1. 



• It is possible to benefit from further computational efficiency via approximations. 
In particular, if the random field has a long correlation length, then it may be 
highly redundant (and poorly conditioned) to preserve every pixel along the quad- 
rant boundaries. Instead, it may be very reasonable to select the state by subsam- 
pling or averaging [213, 232], as illustrated in Figure 10.9. 

• The overlapped approach of Section 8.2.3 lends itself particularly well to the mul- 
tiscale environment [169], especially when the states are approximated, rather 
than exact. With approximate states the quadrant decorrelation is imperfect, re- 
sulting in estimation inconsistencies and artifacts along the quadrant boundaries; 
by overlapping these artifacts can be reduced, as illustrated in Figure 10.1 1. 

• Because the update step of the Kalman filter is essentially a static estimate, it is 
possible to use the multiscale estimator to solve the update step of large Kalman 
filters [191]. Indeed, the update step in the dynamic estimator in Application 4 
was based on the multiscale method of this section. 



It is possible to calculate both prior and posterior samples from the multiscale model, 
which is discussed in Section 11.2.3. 
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200 210 220 

Longitude East 

Fig. 10.10. A comprehensive, multidimensional estimation example [232]. Ship-based ocean 
temperature measurements are taken, as was illustrated in Figure 1.3 on page 5. The mea- 
surements are densely sampled in depth and the problem is essentially statistically stationary 
in latitude and longitude, therefore the problem was decoupled into multiple depth slices, as 
discussed in Example 8.1 on page 252. For each depth slice, a first-order tree can be created, 
as sketched in Figure 10.8. To reduce computational complexity and improve conditioning, a 
subsampled state is used, as in Figure 10.9. 




Highly reduced-order model, 
with apparent texture artifacts 




Even further reduced order, 
but overlapped 



Fig. 10.11. Multiscale estimation of random-field textures with reduced-order models, with 
states as in Figure 10.9. The effectiveness of an overlapped approach, as discussed in Sec- 
tion 8.2.3, in reducing quadrant-boundary artifacts is clear. 
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Application 10: Video Denoising [178] 

We wish to denoise video. Since successive images in a video are highly related a 
temporal filter seems reasonable, so we wish to implement something like a Kalman 
filter. However, the images are large and we want real-time filtering, so a standard 
Kalman filter in the spatial domain is not practical. 

Video denoising approaches in the spatial domain [9,41, 192] can be divided into 
three classes: 

Temporal-only: An approach utilizing only the temporal correlations, neglecting 
spatial information. 

Spatial -only: Apply 2D spatial denoising to each video frame, taking advantage 
of the vast image denoising literature, but ignoring the temporal correlations. 

Spatio-temporal: More sophisticated methods exploiting both spatial and tem- 
poral correlations, such as simple adaptive weighted local averaging, 3D Kalman 
filtering, and 3D Markov models. 

As an alternative to spatial-domain processing, we consider wavelet-based approaches, 
which have led to impressive results in 2D denoising. It would seem natural to select 
3D wavelets for video denoising, however there are a number of drawbacks: 

1. There is a clear asymmetry between space and time; for effective denoising we 
need to treat the space and time axes distinctly, not as a large 3D cube of data. 

2. All of the image frames need to be in place in order to apply the 3D wavelet, 
therefore there is a long latency time between acquiring and denoising an image. 

3. 3D wavelets cannot be sensitive to all of the possible object motions. 

We can address these drawbacks by applying the 2D wavelet transform to each 2D 
image frame, and then performing spatio-temporal video filtering in the wavelet do- 
main. That is, essentially we want to develop a large Kalman filter in the wavelet 
domain. 

Next, we wish the temporal dynamics to correspond to motion. For most wavelets 
image motion does not imply a motion of coefficients in the wavelet domain, so we 
need to choose a shift-invariant, overcomplete wavelet transform [222] . The benefits 
of such an approach are clear: 

1. The recursive, frame-by-frame approach implies low latency; 

2. The wavelet decorrelative property allows very simple, fast, scalar temporal fil- 
tering; 
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Motion for Prediction Step 



Fig. 10.12. A frame from the Paris video, showing the inferred motion which is used as the 
time-prediction step in the wavelet domain. 



3. Where motion estimates are unreliable, spatial (non-temporal) methods can pro- 
vide denoising. 



Given a noisy image sequence 

m(t) =z(i)+v(i), (10.60) 

we transform it into the wavelet domain 

m(t) = Wm(t) = Wz(t) + Wv(t) = z(t) + v(t), (10.61) 

where the explicit transformation of the problem is possible because each image is 
densely sampled. We assert an autoregressive form for the signal model to fit the 
Kalman filter Gauss-Markov dynamics: 



z(t + 1) = A(t)z(t) + B(t)w(t) 



(10.62) 



for some white, stochastic driving process w. The inference of A and B is simpli- 
fied here by assuming that each frame is related to its predecessor, subject to some 
displacement field D(t). Since the selected wavelet is shift-invariant, the wavelet co- 
efficients are subject to the same motion as the image itself, such as those illustrated 
in Figure 10.12, thus the dynamic model simplifies as 



Z.(t) = Z., n (t-l) + 0-w(t). 



i+D 



(10.63) 



This model captures only translations, and not occlusion or zooming. One can choose 
to test model (10.63) by hypothesis testing, and where the model is invalid (that is, 
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Original Video Frame 



Image Frame with Added Noise 




Wavelet-Kalman Denoised 



Difference between Original and Denoised 



Fig. 10.13. Applying the spatio-temporal wavelet denoising method of [178]: Wavelet artifacts 
are not apparent in the reconstruction and the error image is small, with great noise reduction 
in those parts of the images, away from edges, which can be reliably predicted over time. 



where Z(t — 1) and Z(t) cannot be matched by translation), then the null model can 
be asserted: 

Z.(i) = • Z.(t - 1) + B(t) • w(i). (10.64) 

This model has no dynamics, and is thus a purely spatial problem, to which standard 
image denoising methods (Appendix C) can be applied. 

The above approach was developed, implemented, and applied to video [178], with 
results as shown in Figure 10.13. For a relatively simple idea — using a Kalman 
filter in the wavelet domain with the prediction step based on motion estimation — 
the results are quite compelling. 
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Fig. 10.14. An overview and visual comparison of estimation methods. The state (black dots) 
is written in terms of a model over some portion of the domain (shaded), however the estimated 
state may be influenced by measurements within some larger domain (hatched). 



Summary 



Figure 10.14 visually compares twelve methods of estimation. The models differ 
primarily on the basis of locality, causality, and order / complexity. 

Clearly this list is not exhaustive, and should be understood to be complementary to 
other methods and alternatives already explored in this text: 

• The Kalman filter algorithmic alternatives in Section 4.3, 

• The methods of transformation and dimensionality reduction in Figure 5.2 at the 
start of Chapter 5, 



• The methods of representation in Figure 6.13 at the end of Chapter 6. 
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For Further Study 



The reader may find it interesting to follow the evolution of the Kalman filter, for 
spatial processing purposes, from the development of the original filter in 1960 [182], 
the Kalman smoother in 1965 [266], the two-dimensional strip filter in 1977 [337], 
the reduced update Kalman filter in the early 1980s [188,338], the reduced order 
Kalman filter in 1989 [7, 287], and more recent work in three-dimensional and video 
filtering [192]. 



Sample Problems 



Problem 10.1: Other Approaches to Marching 

The marching method of Section 10. 1 describes the solution of a two-dimensional 
problem by marching column-by-column. However, there is nothing special 
about breaking an image into columns. 

Describe how you could perform 2D static estimation by dynamically marching 
diagonal by diagonal. 

Problem 10.2: Marching Limitations 

There are a number of limitations to the first-order, causal marching method. For 
each of the following limitations, briefly discuss or suggest an alternative: 

(a) The computed estimates are causal, depending only on measurements in the 
current and previous columns. 

(b) The first-order dynamic model is a poor approximation of the underlying 
prior statistics. 

(c) The computational complexity is too high when each column has very many 
elements. 

(d) The marching method assumes Markovianity; what do I do for a stationary 
non-Markov problem? 

Problem 10.3: 2D Marching 

Let Z — [z t ... z 64 ] be a 64 x 64 two-dimensional zero-mean stationary process 
with periodic boundary conditions. 

Let Q be the "Tree-Bark" kernel from Example 6.1 on page 190. We consider 
two possible prior models for Z\ 
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g 1 = g and g 2 = (G) T , 

where (Q) T is just the transpose of the kernel itself. Using the FFT method we 
can invert the model kernel Q\ to find the covariances 

Pself = COV(^) P cr0S s = E [z { z[ +1 ] 

Note that because of the stationarity of Z, P se if and P cr0 ss are not a function of i. 

We will be marching column by column. From P se if and P cr0 ss determine A, BB T , 
the matrices in the first-order dynamic model. Initialize the Kalman filter with 

1(1|0) = P(1|0)= Pself. 

Use this Kalman filter to answer the following: 

(a) Suppose we observe only a single pixel 

10.0 = m = Z(20, 20) + v v ~ a 2 = P se if(l, 1). 

Use the Kalman filter to compute estimates and error variances; plot and 
interpret the results. 

(b) Could the FFT method have been used to compute the estimates and error 
variances in part (a)? 

(c) Now suppose we observe every pixel. Use the FFT method, based on prior 
Qi, to create a random sample image M, which we will use as our observa- 
tions: 

M = Z + V [V} : ~ R, Rij = (^P self (l, 1). 

Use the Kalman filter to compute estimates and error variances; plot and 
interpret the results. 

(d) Could the FFT method have been used to compute the estimates and error 
variances in part (c)? 

(e) Now repeat the entire problem, through part (d), but using prior kernel Q2 
rather than Q\. Plot the two sets of estimates and error variances. 

(f) How are the results from Q\ different from those of Q2 ? Pay close attention 
to the error variance plots. 
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Problem 10.4: Marching Variations 

The implementation in Problem 10.3 used the standard Kalman filter. For a suf- 
ficiently large 2D or 3D domain, the regular Kalman filter might have computa- 
tional limitations, and very likely issues with numerical stability. Repeat Prob- 
lem 10.3(a), but instead using 

(a) The square root KF of Section 4.3.1 

(b) The Strip KF of Section 10.2.3 

(c) The RUKF of Section 10.2.4 

Problem 10.5: Acausal Marching 

The implementation in Problem 10.3 used only the Kalman filter in computing 
estimates, which are therefore causal, whereas we know that an acausal approach, 
Kalman smoothing, can yield superior results. Repeat Problem 10.3(a), but using 
the RTS Kalman smoother, as described in Section 4.3.3. 

Discuss the differences you see between the causal and acausal estimates and 
error variances. 

Problem 10.6: Open-Ended Real-Data Problem — Large-Scale Kalman Filtering 

Implement a Kalman filter for a large-scale, dynamic dataset. Many remote sens- 
ing satellites (ATSR, Topex, ERS) have data freely available online. 

Avoid datasets having a complex forward problem, such as the radiometric one 
discussed in Application 3. Instead, we seek direct measurements, such as of 
ocean height (Topex) or sea- surface temperature (ATSR). 

The two key challenges, as were discussed in Application 4, are that we have 
sparse measurements of a changing field: 

• If the measurements were dense, then we would only need some sort of de- 
noising or interpolation. 

• If the underlying field were not changing, we would just have a spatial static 
problem with many measurements. 

Develop an iterative, Kalman-filter like approach to producing dynamic estimates 
of the underlying random field, based on the satellite data. 



11 

Sampling and Monte Carlo Methods 



The matter of statistical sampling was discussed in Chapter 2: Prior Sampling in 
Section 2.5.2, and Posterior Sampling in Section 2.5.4. 

Given a random variable z obeying some prior probability density function p(z), 
sampling from the prior distribution means generating independent random samples 
£i,£2,... from p(z): 

z z ~p(z), (11.1) 

and similarly posterior sampling from a distribution conditioned on measurements: 

(z.i\ui) ~p(z\nk)- (11.2) 

The key is to decompose (z^rn) into deterministic and stochastic components, es- 
sentially the Wold decomposition [248]: 

(z\m) = (i|m) + (i|m). (11.3) 

That is, the posterior z equals the estimate plus a random sample obeying the esti- 
mation error statistics, as illustrated in Figure 11.1. Much of this text has looked at 
methods for generating estimates J.; the key to posterior sampling, then, is a means 
of sampling the error process z. 

The first part of this chapter develops algorithms for continuous- state prior and pos- 
terior sampling, based on the Kalman filter, marching methods, and multiscale meth- 
ods, paralleling the sequence of methods presented in Chapter 10. 

The substantial last part of this chapter, in Section 11.3, develops Monte Carlo and 
discrete- state methods. These are of key importance for discrete- state fields, in par- 
ticular those as part of a hidden Markov model in Chapter 7. 

The goal of Monte Carlo samplers is to find a sequence of states, dependent only on 
a Gibbs energy H, such that the state sequence converges to a random sample of the 
probability density implied by H\ 
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Truth 



z\m 
Estimates 






z\m 
Sampled Error 



z\m 
Posterior Sample 



Fig. 11.1. The process of posterior sampling. The top two panels show a sample from an 
anisotropic prior model and estimates based on the central measured columns. The bottom-left 
panel shows the sampled estimation error, where a low-variance zero-mean band can be seen 
around the measurements, where the estimation uncertainties are small. The final panel shows 
the sampled posterior, consistent with both the measurements and the prior statistics. The 
estimates and sampled error were generated using the multiscale approach of Sections 10.3 
and 11.2.3. 



H 



' Z-1-) Z-2-) 



such that 



lim Zj 



exp(-flff) 
Z " 



(11.4) 



As we saw with Gibbs fields in Chapter 6 and in the context of hidden models in 
Chapter 7, the Gibbs model makes no particular distinction between prior and pos- 
terior models. The distinction is, in fact, only a matter of whether the measurements 
appear (posterior) or not (prior) in the energy function H. 
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11.1 Dynamic Sampling 

In the context of a linear, Gauss-Markov dynamic process (4.1), the task of prior 
sampling is straightforward, just a simulation of the evolution of the dynamic process 
over time. The process initialization 

z(0)~M(0,Pq) (11.5) 

requires sampling z(0) from the prior model Pq, based on the matrix square root of 
Po, as discussed in Section 2.5.2 and Appendix A. 8. With the recursion initialized, 
dynamic prior sampling proceeds as 

z(t + 1) = A(t)z(t) + B(t)w(t), (11.6) 

where w(t) is a zero-mean, unit- variance Gaussian random vector. 

Next, in order to do posterior sampling of a dynamic process, from (11.3) it is the 
statistics of the estimation error which we need to identify, therefore for recursive 
posterior sampling it is really a dynamic relationship for the estimation errors which 
we require: 

2(t + i) = A(t)z(t)+a(t), 1(0)~a/-(i o ,p o ), a(t)~M{&Q(t))' (H-7) 

Although it is not obvious that such a form is obeyed, the error process of the Kalman 
filter does in fact obey such a dynamic process. The statistics for the predicted es- 
timation error z(t + l\t) were derived in (4.124) on page 108. Rewriting for the 
updated estimation error z(t\t) yields 

z(t\t) =z(t\t) -z(t) (11.8) 

= (I - K{t)C)Az{t\t - 1) + K(t)v(t) -(I- K(t)C)Bw(t) ; (11.9) 

consequently we can identify the dynamic parameters 

A(t) = (I- K(t)C(t))A (11.10) 

Q(t) = K(t)R(t)K T (t) + (I - K(t)C(t))BB T (l - K(t)C(t)) T (11.11) 

with initialization 

lo=Q (11.12) 

Po = Po. (11.13) 

Therefore for posterior dynamic sampling we must first run the Kalman filter to 
compute the gain K(t) for each t, then draw a random sample 

l(0)~A/*(Q,P o ) (11.14) 

to initialize, then recursively compute a sample error 
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lit + 1) = Az(t)+a(t) g(t)~AT(0,Q(t)), (11.15) 

and finally the posterior sample is computed as the sampled error added to the esti- 
mates: 

z\m(t) =z(t) + z(t). (11.16) 

The error process of the Kalman smoother, of Section 4.3.3, also obeys a dynamic 
process [20] . The derivation is considerably more complicated, however the recursive 
form of the smoothing-error process z s is straightforward: 

z s (t) = z(t\T) - z(t) (11.17) 

= A~\t) (I - BB T p-\t + l\t)) z s (t + 1)- A-\t)(L s (t), (11.18) 

where P(t + l\t) is the regular, predicted error covariance from the Kalman filter, 
and where the noise process £ s has covariance 

a s (t) ~ B(I- B T P~ 1 (t + l|t)B) P T . (11.19) 

Observe that the smoothing error recursion is going backwards, from t + 1 to t, just 
like the RTS smoother, which explains the presence of inverse dynamics A -1 . 



11.2 Static Sampling 

There are only few ways in which a static estimation problem may be characterized: 

• Given a covariance P or inverse P _1 , whether prior or posterior; 

• Given a set of constraints L in a regularized, non-Bayesian problem. 

Let us investigate each of these two contexts in turn. 

1 . Given Covariance: Given a covariance matrix P, whether P describes a prior 
or posterior 

Prior: z~(^P) or Posterior: z\m ~ (£, P) (11.20) 

is immaterial, because in both cases the sampling process is identical: 

_ , t t f Prior Sampling: z = u + x 

Let x ~ r be a random sample ^> < 

I Posterior Sampling: z\m — z + x. 

(11.21) 

Next, whether we are given a covariance P or its inverse P _1 , such as in a Markov 

setting, is also immaterial, since in both cases we find a matrix square root via the 

Cholesky decomposition (Appendix A.7.3): 
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Estimate z 

Storage: 1.3% 

Computation: 6% 



Posterior Error Sample z\m 

Storage: 12% 

Computation: 50% 



Fig. 11.2. An estimate and posterior sample, generated from a sparse prior, using a Cholesky 
decomposition. Storage and computational complexity are reported relative to a non- sparse, 
matrix-inversion approach. The higher storage complexity associated with the posterior sam- 
ple is due to fill-in in the Cholesky step from P _1 to L in (11.24). 



p cho1 - p _ pTp 
P- 1 ™> P- 1 = L T L 



z = r 1 w 

Lz = w 



(11.22) 



Although the latter case appears more difficult, because L is a triangular matrix 
the latter Lz = w step is very simply solved by backsubstitution. 



In the case where P represents the posterior covariance from an estimation 



p= {cFr^c + p- 



(11.23) 



it should be very clear that we do not want to explicitly calculate P as part of 
posterior sampling. Indeed, C and R are normally sparse or even diagonal, and 
in any Markov setting P" 1 will be sparse-banded, therefore P _1 will be sparse. 
In (1 1.22), the Cholesky decomposition L will retain the sparsity of P _1 , making 
the entire sequence 



C. R, P„ 



P- 1 



Choi. 



(11.24) 



exceptionally computationally and storage efficient, as quantified in the example 
in Figure 11.2. 

2. Given Constraints: The constrained sampling problem is superficially straight- 
forward, since the constraints L assert 



Lz 



w 



w ~ I 



(11.25) 
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Type of Problem 



Constraint Rank Random Sample 



Membrane, ID 




L — L x Full-Row z - 


= L T (LL T )- 1 w 


Thin-Plate, ID 




L — L xx Full-Row z - 


= L T (LL T )- 1 w 


Membrane, 2D 




L = 


J^x 

Ly_ 


Rank-Deficient z - 


= L+w 


Thin-Plate, 2D 




L = 


-L*xx 

_l J yy_ 


Rank-Deficient z - 


= L+w 


nD Membrane + 


Weak Mean L — 


al 


Full-Column z - 


= {L T L)- l L T w 


riD Thin-Plate + 


Weak Mean 


L = 




Full-Column z - 


-- (L T L)- x L T w 











Table 11.1. Six examples of constraints, rank properties, and associated method of sam- 
pling. Whereas Bayesian approaches will always be positive-definite, and can be solved by 
a Cholesky decomposition, many constrained problems, such as here, will be rank-deficient 
and necessitate a somewhat more complex pseudoinverse. 



meaning that the matrix square root in (11.22) is already available in L, and all 
that remains is to choose a random w and to solve the linear system in (1 1.25). 

However, whereas a given covariance P or P _1 is guaranteed to be positive- 
definite, simplifying the preceding discussion, throughout Chapter 5 we saw that 
the constraints matrix L is normally rectangular and may be rank-deficient. 

As discussed in Appendix A.9, the matrix pseudoinverse L + finds a solution to any 
linear system Lz = w, returning either the unique answer if L is invertible, the 
least-squares answer if the problem is overconstrained, and the least-norm solution 
for z if the problem is underconstrained. A number of possible constraints and 
associated solutions are illustrated in Table 11.1, and two samples are plotted in 
Figure 11.3. 

We need to be clear that although z — L + w_ generates a random sample, consistent 
with the asserted constraints Lz = u>, because the regularized problem is non- 
Bayesian and has no prior, the generated samples, such as those in Figure 11.3 
cannot really be considered prior samples. 

Finally, it should also be pointed out that in cases of full rank the pseudoinverse 
is easily calculated as (L T L) -1 L T or L T (LL T ) -1 , for full-column and full-row 
rank, respectively. As before, since LL T and L T L are symmetric, positive-definite, 
the Cholesky decomposition should be used in calculating the matrix inverse. 
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2D Membrane, No Boundary 



2D Thin-Plate, No Boundary, Centre Cut 



Fig. 11.3. Two random samples generated from rank-deficient constraints L by finding the 
pseudoinverse L + . Although neither image can be considered a true prior sample, since the 
problems are non-Bayesian, nevertheless the samples clearly illustrate the behaviour of their 
respective constraints, with the cut and the nonperiodic boundaries particularly clear in the 
second-order example, right. 



For larger problems, a nested-dissection reordering (Section 9.1.3) can be applied to 
improve the computational and storage efficiency of the Cholesky steps. However, to 
tackle sampling for very large problems we need some sort of problem transforma- 
tion or domain decomposition, discussed in the following sections. 



11.2.1 FFT 



The FFT method of sampling was already discussed in Section 8.3, and a large frac- 
tion of the random-field images plotted in this text were generated using the FFT 
approach. 

For fully stationary problems with periodic boundary conditions, the FFT diagonal- 
izes the problem, allowing problems of nearly arbitrary size to be tackled. The prior 
sample 

(11.26) 



Z = FFT- 

and posterior sample 

(Z\M) = Z + FFT~ 
were developed in (8.78),(8.82). 



FFT d (7>)0FFT d (WO) 



FFT d (P)QFFT d (W) 



(11.27) 



Even if a problem is not fully stationary, the FFT approach may still have merit in 
generating samples of stationary portions of a problem, or in generating stationary 
samples ignoring the nonstationary aspects of a boundary, for example. 
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Fig. 11.4. Given one or more continuous- state FFT samples, discretizing them gives an ad 
hoc, but very fast, approach to discrete- state sampling. By combining multiple fields, right, 
samples with relatively complex, multiscale morphology can be synthesized. Here, Zf and Z c 
are continuous- state thin-plate samples at fine and coarse scales, respectively. 



Although most discrete- state sampling methods are based on the MCMC methods in 
Section 11.3, a FFT sample can be discretized [208] as a very fast, albeit heuristic, 
approach. Three examples are shown in Figure 1 1.4; clearly quite creative things can 
be done by combining multiple random fields, and the efficiency of the FFT allows 
this approach to be generalized to three-dimensional domains. 



11.2.2 Marching 



In Section 10.1 we took advantage of the static-dynamic duality, originally discussed 
in Section 4.1.2, to allow a static problem to be partitioned and estimated recursively 
using a Kalman filter. Clearly precisely the same concept can be extended from esti- 
mation to sampling. 

That is, given the statistics of the static problem and a proposed partitioning, we can 
identify the cross-statistics to form a dynamic model, as described in Section 10.1. 

For prior sampling, we apply the identified dynamic model to the dynamic sampler 
of Section 11.1. In the case of posterior sampling we run the Kalman filter, possibly 
large (as in Section 10.2), build the dynamic model (11.7) for the estimation errors, 
and compute the posterior samples from (11.15). 

The difficulty in sampling from a marching approach is that the marching approach 
is causal, whereas most static problems are noncausal, as was illustrated in Exam- 
ple 10.1. Consequently we would normally prefer smoothing, as discussed in Sec- 
tion 10.2.1. 
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The simple, ad hoc approach to smoothing illustrated in Example 10.1, where causal 
and anticausal estimates are combined, cannot apply to random prior or posterior 
sampling, however. The reason is that multiple suboptimal estimates z { are all at- 
tempting to estimate the same quantity z, therefore the {i^} are correlated and can 
be averaged: 

ii ~ (1, Pi) => Ei %] * z. (11.28) 

On the other hand, random samples z { are driven by independent noise processes, 
and do not average meaningfully: 

I*~(Q,A) => ^<[I<]«Q. (11.29) 

It is for precisely the same reason that the overlapped approach of Section 8.2.3, 
which performs pointwise averaging simlar to (11.28)— ( 11.29), is much more effec- 
tive for estimation than for sampling. 

In contrast, if a sparse Kalman filter is implemented such that it is possible to save 
the covariance information in order to implement the regular Kalman smoother, then 
the recursive form of the smoothing-error process (11.18) can be used to generate 
posterior samples. 



11.2.3 Multiscale Sampling 

The multiscale method of Section 10.3 obeys a scale-to-scale dynamic model (10.54), 

x(s) = A(s)x{\s) + B(s)w(s), (11.30) 

so prior sampling in the multiscale environment follows trivially from Section 11.1. 
The linearity of the multiscale model allows the prior mean to be separated from the 
remainder of the problem, so we can, without loss of generality, assume zero mean. 

We begin at the root node 

p(o) c ^hp(o) = r T r => x(o) = r T a g~i. (ii.3i) 

After this one Cholesky decomposition, the remaining nodes follow per the dynam- 
ics: 

x(s) = A(s)x(U) + B(s)w(s) w(s) ~ I (11.32) 

Next, since the multiscale estimator was derived from the RTS smoother, and that a 
smoothing-error process (11.18) has been derived [20] for the RTS smoother, it turns 
out to be true that the error in the estimates of the multiscale method itself obeys a 
multiscale model [215]: 

x(s)=A(s)x(U)+m(s), m(s)~M(0,Q(s)), 1(0) -A/"(i(0),P ), (H.33) 
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Fig. 11.5. A random sampling approach to estimating the value of tt. In the left panel, the area 
of the square is 4, and the area of the dashed circle is tt • l 2 . Therefore, placing uniformly- 
distributed random samples and multiplying the fraction inside the circle by four gives us an 
estimate of the value of tt, right. 



where A(s), Q(s) are expressed [215] in terms of the model parameters A(s), B(s), 
prior co variances P(s), and estimation error co variances P u (s) computed in the up- 
wards pass of the multiscale estimator: 

A(s) = P u (s)F t (s)P-\U) (11.34) 

Q(s) = P u (s) - P u (s)F T (s)p- 1 (U)F(s)Pu(s) (11.35) 

F(s) = P(U)A t (s)P- 1 (s). (11.36) 

The availability of a multiscale smoothing error model leads to an approach to hier- 
archical posterior sampling, following the dynamic approach of Section 11.1: 



1. The multiscale estimator computes x(s), P(s) at each node on the tree. 

2. Infer the smoothing error model A(s),Q(s) from (11.34)— (11.36). 

3. Using the Cholesky decomposition compute the matrix square roots 

/- \i/2 ~ /- \l/2 

r(o) = (p(o)J r(s) = (Q(s)) ,*^o. 



4. Sample the smoothing error process, coarse to fine: 

(I| m) (0) = r T (0)^(0) , w(0) - A/"(0, 1) 

(x\m)(s) = A(x\m)(U) + f T (s)w(s), w(s) - AT (0,1). 



(11.37) 



(11.38) 
(11.39) 



5. Finally, add the mean (the estimates) to produce the posterior sample: 

(x\m)(s) = x(s) + (x\m)(s). (11.40) 
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Fig. 11.6. Regular gradient descent, left, will converge to and stay in a local minimum. 
Stochastic gradient descent, right, attempts to escape local minima by accepting inferior val- 
ues of x with some probability, essentially perturbing or agitating the current position in order 
to overcome barriers of limited height. 



If the state z(s) has dimension n(s), then each of the estimation, smoothing, and 
square root steps has complexity 0(n(s) 3 ) per node. 

The posterior sampling images in Figure 11.1 were calculated via posterior sampling 
on a multiscale model. 



11.3 MCMC 



The great increase in computer processing power over the last twenty years has cre- 
ated opportunities for a class of algorithms, exceptionally simple in concept, but with 
very large computational complexity. The methods, known collectively as Markov 
Chain Monte Carlo (MCMC) methods [42, 64, 88, 137, 155, 194, 335], are based on 
using randomness to solve mathematical goals. 

A simple, classic example is shown in Figure 1 1.5, whereby random samples can be 
used to estimate the value of it. Although the same approach could be used with a 
non-random dense, regular grid, such a grid requires the number of sample points to 
be decided upon ahead of time, and a valid answer is reached only when the grid is 
complete. 

The MCMC approach is impractical for the specific problem of estimating n, since 
there exist much faster specialized methods. Nevertheless the random sampling ap- 
proach is an exceptionally elegant and simple idea, capable of sampling from any 
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Algorithm 10 Single Gibbs Sample 



Goals: Perform Gibbs sampling at site i in z based on H () 
Function z = SingleGibbs(if () ,T,z,i,&) 

z<—z 

for j E \P do 

h(j) <— exp(i/(z, z t = j)/T) Test all possible state values. 

end for 

s <— sum(/i) Compute marginal partition function 

r <— UniformRandom() 

for j G & do 

p^p + h(j)/s 
if r < p then 

^ <— j Sample state j from marginal distribution. 

Return 
end if 
end for 



distribution, with further generalizations to sampling state spaces of variable dimen- 
sions [147]. 

Consider stochastic gradient descent, as illustrated in Figure 11.6. Regular gradient 
descent seeks to minimize an objective function f(x) by moving against the gradient, 

%k+i -x k - af'(x k ), (11.41) 

but may converge to, and stay in, a local minimum, where f'{x) = 0. Stochastic 
gradient descent adds a degree of random variability, such that the current point has 
a nonzero probability of accepting an inferior solution by moving "uphill," increasing 
the likelihood of escaping local minima. If we let T k be the "temperature" parameter 
(the degree of agitation) at iteration k, then T k needs to decrease to zero to allow the 
system to finally converge: 

Rapid decrease in T k —> Fast convergence, more likely stuck in local minimum 
Slow decrease in T k — ► Slower convergence, more likely at global minimum. 

(11.42) 

This concept and resulting tradeoff is the essence of annealing, discussed in the 

following section. 



11.3.1 Stochastic Sampling 

Rather than random perturbations of a ball on a hill, we are interested in considering 
random state transitions Z => Z with transition probability P(Z => Z). Indeed, 
consider a sequence of perturbations 

Zi->Z 2 ->--- (11.43) 
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Algorithm 11 Single Metropolis Sample 



Goals: Consider a state change z^zon the basis of the probability density implied by H() 
Function z = SingleMetropolis(#() , T, z, z) 

h <- H(z)/T 

h <- H{z)/T 

r <— UniformRandom() 

if r < exp(/i — h) then 
z <— 1 

else 

end if 



Remarkably, this sequence will converge [144, 335] to a random sample of distribu- 
tion p(Z) if two conditions hold: 

IRREDUCIBILITY: Every state must be reachable, 

P (/c) (Z^Z)>0 VZ,Z (11.44) 

for some k > 0, where p( fc ) is the /c-step transition probability. 

Detailed Balance: There is a balanced equilibrium, 

p(Z)P(Z =>Z)= p(Z)P(Z => Z). (11.45) 

The two most common choices of state perturbation or transition are the Gibbs [335] 
and Metropolis samplers: 

GIBBS: (Algorithm 10) 

Randomly sample a single state element Z{ from its marginal distribution: 

P(Z^a\Z)= rt Z \f; = a \ (11.46) 

Metropolis: (Algorithm 11) 

The transition Z =^> Z depends on the relative likelihoods: 

p{Z) > p(Z) Accept transition with probability 1. 
p{Z) < p(Z) Accept transition with probability 4^y 

That the Metropolis transition satisfies detailed balance (11.45) is obvious. Sup- 
posing, without loss of generality, that p(Z) > p(Z); then 

p(Z)P(Z^Z) = p(Z)P(Z^Z) 

pin 1 

> *™ , > , (11.48) 



P(Z) "= p(Z) 
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Fig. 11.7. As temperature T is decreased, the probability density exp(— H/T)/Z(T) is 
weighted more and more heavily towards peaks in the distribution. 



Recalling the definition of a Gibbs distribution 1 from (6.39), 

1 



P(z) = -e 



~H(z)/T 



(11.49) 



what is significant is that neither sampler requires the calculation of the difficult 
partition function Z, since the Gibbs sampler needs only marginal distributions, and 
Metropolis needs only ratios of distributions in which Z cancels: 



p(Z) y~»(z)/T , H (Z)-H(Z) \ 

p(Z) U-*( z )/ T P V T J' 



(11.50) 



Furthermore, recalling from (6.42) that if is a sum of potential functions over 
cliques, 

H(Z) = Y,V({zi,i€c}), (11.51) 

cec 

and letting 

C = {c\Z c ^Z c } (11.52) 

be the clique set over which Z, Z are not identical, then the probability ratio from 
(11.50) becomes 



P(Z) ( Ec£cVc(Zc)-V c (Z c ) 

m =exp { t 



(11.53) 



which is a straightforward sum over a modest number of potential functions, with no 
partition function or probability density ! 



1 Strictly speaking, in (6.39) Gibbs parameter /3 = 1/kT, where k is Boltzmann's constant. 
Since we are not working with actual physics-based Hamiltonians H, constant k is unim- 
portant and we drop it for clarity. 
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Algorithm 12 Basic, Single-Scale Annealing 



Goals: Solve the optimization problem z = arg 2 min H(z) with annealing schedule Tk 
Function z = Anneal (#(), {T k }, z ) 
k^O 
while not converged do 

z k+1 <— MetropolisSampler(z fc , if (), T fc ) 
fc <- A; + 1 
end while 



Next, following up on the illustration in Figure 1 1.6, what role does "temperature" T 
play in (1 1.53)? For a fixed value of T, we are performing random sampling, either 
posterior or prior — depending on whether H includes measurement terms. If T is 
decreased with iteration, we are slowly "cooling" the random perturbations, a process 
known as simulated annealing [126, 127,335], shown in Algorithm 12. As T — > 
we are sampling from the global minimum (or possibly multiple global minima) of 
H, as illustrated in Figure 1 1.7: 

te**>-a !5 ^PP a -isji:'('-'a>. <"- 54 > 

where 

Z^ n e 2 mn = {arg z mintf(Z)} = {arg z maxp(Z)}. (11.55) 

If H contains measurement terms, then p() is a posterior probability, in which case 
the annealed sample is taken from the set of points maximizing the posterior distri- 
bution, meaning that we have found an MAP estimate: 



Z 



= Z M ap(M). (11.56) 



As already discussed in (1 1 .42), whether the sampler actually converges to this global 
optimum is a function of the temperature schedule T&. Theoretically [127], to con- 
verge globally requires an exceptionally slow logarithmic schedule Tk oc 1/ log(fc). 
In practice, most approaches choose to cool much more quickly, known as quench- 
ing, commonly using an exponential schedule Tu oc exp(— kr) for some cooling rate 
r. 



11.3.2 Continuous-State Sampling 

The previous section was deliberately ambiguous as to whether the state Z was a 
discrete- or continuous- state process. Although uncommon, it is possible to sample 
and anneal continuous- state fields. 
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Two different 
first-order neighbourhoods 



A checkerboard pattern of 
non-interacting state sets 



Fig. 11.8. A first-order model, left, on a two-dimensional regular grid consists of two non- 
interacting sets (light and dark) of state elements, arranged as a checkerboard pattern. Because 
of the non-interaction, all of the elements in each set can be sampled in parallel. 



Because the Gibbs sampler (Algorithm 10) samples from the marginal distribution, in 
the continuous- state case this would require the difficult numerical characterization 
of a continuous- state PDF [176]. In contrast, the Metropolis sampler (Algorithm 11) 
requires only the evaluation of the relative energies at the current H(Z) and proposed 
H(Z) states, however there remains a challenge of generating meaningful proposals, 
since in the infinity of possible continuous- state values, only an infinitesimal fraction 
represent meaningful alternatives. 

As we have seen throughout this text, there are many methods for solving regularly - 
gridded continuous- state problems, and sampling / annealing methods have little to 
offer. However there are unstructured parametric problems, known as marked point 
processes [246,292], where continuous- state annealing has provided promising re- 
sults. In these problems, a relatively small number of continuous values parametrize 
the behaviour of an image, such as the locations of roads [292] or the orientations 
and sizes of buildings [245]. 



11.3.3 Large-Scale Discrete-State Sampling 



The primary application of MCMC methods in the context of this text is to large- 
scale discrete- state fields, found particularly in the hidden fields of Chapter 7, even 
more specifically as the estimator in the E-Step of the EM algorithm used with hidden 
fields, as described in Section 7.5. 

In the somewhat rare event that our discrete- state field is acyclic (having no loops; 
see Figure 5.1 on page 135), whether sequential or tree-based, then there are very 
efficient approaches for producing state estimates based on the Viterbi algorithm of 
Section 4.5.1: the forward-backward algorithm [265, 275] for sequential problems, 
and a very similar upward-downward algorithm on trees [77, 275]. 
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Fig. 11.9. The energy map for the simple, two-state problem of (11.58) is shown. The only 
likely states are z T = [0 0] and z T = [1 1]. If the current state is z T = [0 0], then it is not 
possible to reach [1 1] in a single-state flip, however the intermediate states [0 1], [1 0] are 
very improbable, so there is a significant barrier in place to the [0 0] — ► [1 1] transition unless 
both states change simultaneously. 



It is much more common, however, for the discrete states to live on a multidimen- 
sional regular grid. As with the marching methods of Chapter 10, we can choose 
to impose a sequential ordering of the grid elements (using a Peano or space-filling 
curve), in which case the efficient forward-backward algorithm applies, however 
such an ordering brings the same modelling drawbacks as we saw with the causal 
random fields and symmetric half-plane models of Chapter 6. 

In general, given a multidimensional discrete- state problem the most immediate ac- 
celeration is vectorization, processing non-interacting state elements simultaneously. 
Specifically, given a set S of non-interacting sites 



S = {si,s 2 ,...} 



such that 



Si 



■M* Vi,j 



(11.57) 



then, if the neighbourhood J\f is relatively small in extent, | aS' | will be fairly large. 
Best known is the checkerboard vectorization, for first-order models such as Ising, 
whereby the entire grid can be divided into only two sets Sbiack and S w hit&, as illus- 
trated in Figure 11.8. 

Methods of acceleration beyond vectorization must somehow address the problem 
of state coupling, the problem of indirection, and the random walk phenomenon of 
Figure 8.18 in Section 8.6, whereby the number of iterations for convergence is a 
quadratic function of structure size or correlation length. 

A related problem is the concept of an energy barrier. Suppose we have a simple, 
two-state Ising model for which the energy map is plotted in Figure 1 1.9: 



H(z) = -f35 z 



OiZi Zi,Z 2 e{0,l} /3>Qf>0 



(11.58) 
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Example 11.1: Annealing and Energy Barriers 



Suppose we have a large two-dimensional field consisting of small and large discs: 




An energy function corresponding to such a field will wish to encourage small 
and large discs, but to penalize discs of other radii, 

Energy 




Disc Size 



creating an energy barrier. If an annealer were initialized with a random set of 
pixel values, it would quickly reduce the energy (increase the probability) by 
clumping the random values into small discs, however the energy function will 
make it very difficult to anneal small discs gradually into larger ones, because of 
the strong penalty placed on intermediate sizes. 

We can try to avoid this energy barrier by formulating the problem at coarser 
scales, where incremental changes correspond to the simultaneous changing of 
many state elements at the finest scale: 




Down 2 Scales 



Down 4 Scales 



Down 5 Scales 



At the coarsest scale, only two to four "black" elements are required in order to 
produce a large, finest-scale "black" disc of approximately 7000 pixels. 
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Algorithm 13 Multilevel Annealing, Initializing from Coarser Scale 

Goals: Solve z = arg 2 min H(z) by annealing, each scale initialized by the preceding 

Function z = Hier Anneal (Scales, {H 3 ()}, {T 3 k }) 

^scales ^_ Anneal(i/ Sca ^ es , {TJ? cales }) Regularly anneal coarsest scale 

for j <— Scales — 1 to do 

Zq <— Project (£ J ' +1 ) Project to finer scale 

z_ 3 <— Anneal (H 3 ( • ) , {T£ } , Zq ) Anneal at scale j 

end for 



Starting in state [0 0], to get to the optimum state [1 1] with single-site state changes 
we need to pass through highly unlikely intermediate states [0 1] or [1 0], an energy 
barrier. Some sort of simultaneous or multiple- state flipping is required. 

In general, there are two possibilities for reducing computational complexity: 

1. Reduce the structure size or correlation length by downsampling; that is, by cre- 
ating a tree or hierarchy. Examples include multigrid Monte Carlo [144], and 
discrete- state hierarchies [5, 50, 186, 197, 234]. Two illustrations of such a hierar- 
chy were shown in Chapter 8 in Figure 8.22 (page 285). 

2. Reduce the structure size or correlation length by making state elements larger; 
that is, by grouping. Examples include Swendsen-Wang [184,297] and region 
grouping [217, 314, 329]; both examples were illustrated Figure 8.21 (page 284). 

The latter approach fits naturally with the Metropolis sampler, since in a proposed 
change Z — > Z there is no implication that only one state element changes, rather 
it is perfectly reasonable to have a whole group or clump of state elements change. 
In particular, in (11.58) and Figure 11.9, the Metropolis state change [0 0] — > [1 1] 
would immediately be accepted. However, exactly analogous to the Metropolis sam- 
pling in the continuous- state case in Section 11.3.2, there again remains a challenge 
of generating meaningful proposed transitions Z — > Z, since only an infinitesimal 
fraction of flipped state groupings actually represent likely alternatives. 

The former approach, constructing the problem on multiple scales, involves learning 
energy functions W as a function of scale j, but is ultimately more straightforward 
than grouped Metropolis since a regular Gibbs sampler can be used. Two general 
approaches to hierarchical annealing may be developed, where the distinction lies in 
how the coarser scale affects the finer one: 

• As the initialization to the finer scale, as outlined in Algorithm 13, and sketched 
in the top sequence in Figure 8.22; 

• As a constraint on the energy function of the finer scale, corresponding to Algo- 
rithm 14, and shown in the bottom sequence in Figure 8.22. 
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Algorithm 14 Multilevel Annealing, Constrained by Coarser Scale 

Goals: Solve z = arg 2 min H(z) by annealing, each scale constrained by the preceding 

Function z = Hier Anneal (Scales, {H 3 ()}, {T 3 k }) 

^scales ^_ Anneal(i/ Sca ^ es , {TJ? cales }) Regularly anneal coarsest scale 

for j <— Scales — 1 to do 
Randomly initialize z_q 

z 3 <— Anneal (if 3 ' (■ \z 3+1 ) , {T£}, z? ) Anneal at scale j 

end for 



11.4 Nonparametric Sampling 



Most of this text has focused on model-based methods, where an explicit model is 
learned. 2 However, where the behaviour of a random field is difficult to parametrize, 
it may be possible to model the field implicitly in terms of the given training data. 

Recall from (2.4) (page 15) that an inverse problem could be solved, in principle, as 

z = rHm) = {z\f(z)=m}, (11.59) 

however the number of possible configurations {z} is huge and computationally in- 
feasible to enumerate. Given a number of training samples J^, we can try to solve the 
inverse problem with an implicit prior model by limiting to the training samples, 

1 — i.i such that ffej) = m, (11.60) 

however it is inconceivable that the correct solution to the inverse problem would 
appear, perfectly, in a finite training sample. 

We have two difficulties: the number of possible configurations {z}, and the problem 
of existence, whether a z even exists such that f(z) = m. We can address both of 
these problems by dividing z into small patches, similar to the local reduction of 
basis in Section 8.2.3, and by using a least-squares (or other) norm in assessing the 
fit between a local patch and its corresponding measurements. 

Indeed, a wide variety of patch-based sampling methods has recently been developed 
[94,119,209]. 

Let us begin with the simpler problem of prior sampling, most commonly applied 
to texture synthesis [94, 209]. Suppose we wish to synthesize a sample, one by one 
constructing the pixels of Z from Z, 

ZieR Z*e{M,NaN}, (11.61) 

where "NaN" (not a number) allows us to distinguish between asserted and undeter- 
mined elements in Z. We wish to examine Z and Z patch wise, so let 



With the notable exception of conditional random fields in Section 7.3. 
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Fig. 11.10. Nonparametric prior sampling: Given a ground-truth image, left, random samples 
can be generated by matching patches between ground-truth and sample. Aspects of pattern 
"memorization" or copying can be seen in the bottom-right synthesis. 



Mu ieK 



(11.62) 



be the region or neighbourhood surrounding location i. We can therefore test the 
goodness-of-fit between some partial patch Z and any part of Z\ 



ZNj, Ztfi 



(11.63) 



With a norm in place, we wish to randomly sample a pixel Zi on the basis of the 
patches in Z which match the region in Z surrounding i: 



z l ~{z 3 \\\z Mj ,z Mx \\=Qy 



(11.64) 



Because Z is of finite size we cannot expect a perfect match, so the norm criterion 
needs to be moderated: 



Zi ~ {Zjl ||Za^, Z/vill < ej. 



(11.65) 



Z is initialized with a tiny patch of pixels from Z, after which (11.65) is repeatedly 
applied until the entire domain is sampled. The elegance of the approach is that only 
e and the patch size \J\f\ need to be specified; all other aspects of the model are 
nonparametric. 
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Fig. 11.11. Patch-based Superresolution: It is possible to infer a high-resolution image Z, 
right, from low-resolution measurements, middle, given high-resolution data Z which con- 
strain the permissible patches in Z. 



Two illustrations are shown in Figure 1 1.10. The synthesized patterns are quite cred- 
ible, however there is a clear dependence on patch size. Sizes which are too small 
(local) fail to adequately sense the spatial texture, whereas sizes which are too large 
will tend to "memorize" Z, such that only a single patch matches in (1 1.65), and the 
synthesized pattern grows nearly deterministically. 

Generalizing the above patch-based method to solving inverse problems and poste- 
rior sampling may be difficult: 

1 . We cannot assume a dense, regular grid for the measurements m. 

2. The forward model /() may be nonlocal, a poor fit to the local, nonparametric 
model of the patch methods. 



There is, however, an important class of problems involving resolution-enhancement 
or superresolution [14, 104], in which we wish to find a high-resolution image based 
on a ground-truth image and low-resolution measurements (for example see Applica- 
tion 11, following this section). Because the low-resolution measurements are dense 
and downsampling is a local operator, the above two difficulties do not apply, and 
nonparametric patch methods have been developed [90, 119] for this case. 

Although the implementation details are somewhat complicated, essentially we now 
seek a pixel from those patches simultaneously satisfying the forward problem and 
the already-sampled parts of Z: 

Z z ~ {^|||%p Z Afi \\ < e, \\rn % J(Z Xj )\\ < <j}, (11.66) 
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where m i is the subset of the measurements which apply to region M\ around pixel 
location i. An example illustrating example-based superresolution is shown in Fig- 
ure 11.11. 



Application 11: Multi-Instrument Fusion of Porous Media [236] 



We wish to reconstruct high-resolution samples of scientific images. Shown below 
are two microscopic images of physical samples: 




***0 : 




(Microscopic Data from M. Ioannidis, Dept. Chemical Engineering, University of Waterloo) 

Producing such samples is difficult and expensive, since a physical sample needs to 
be cut open and carefully polished. Even more troubling is that cutting and polishing 
may somehow affect the physical sample, leading to distorted images. It would be 
far preferable to image a physical sample directly in 3D, for example using a MRI. In 
such a case there is no distortion or cutting of the sample, however the measurements 
are at a far lower resolution: 




L 



to** 




Magnetic resonance imagers can be configured in a great variety of ways and it is 
possible, for example, to measure the diffusivity of water, meaning how tightly the 
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water molecule is constrained, whether in a small pore (tightly bound, dark) or in a 
much larger one (weakly bound, light): 
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This is essentially a measure of surface-to- volume ratio (or perimeter-to-area in 2D). 

The scientists studying such porous materials want a fine- scale image, consistent 
with the given measurements. We therefore wish to do posterior sampling, given 
a prior model generated from a high-resolution field, coupled with low-resolution 
measurements from one or more instruments. The following posterior samples were 
sampled, using a Gibbs sampler with annealing [236]: 





.§13 S& 







The resulting images are visually a significant improvement over the low-resolution 
measurements. 

Clearly the low-resolution measurements in no way allow us to actually perfectly 
reconstruct the high-resolution images, however the use of a posterior sampler strikes 
a fine balance: 

• Force the resulting image to be consistent with the measurements, where actually 
constrained by the measurements; 

• Where not constrained by the measurements, generate random samples consistent 
with the prior model. 



That is, much of the very fine- scale detail cannot be inferred from the measurements, 
but is statistically correct, in the sense that it is consistent with the prior model. 
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For Further Study 



A great deal has been written on Markov Chain Monte Carlo methods, as applied to 
statistical problems in image processing. The books by Li [207] and by Winkler [335] 
are both recommended. 



Sample Problems 

Problem 11.1: Vectorization 

Suppose we have a higher-order model, such as thin-plate, on a regular 2D grid. 
Sketch the pattern of non-interacting state sets, along the lines of Figure 11.8. 

Problem 11.2: MRF Sampling 

Suppose we are given the "Tree-Bark" kernel from Example 6.1 on page 190, 
from which we want to generate prior samples. There are four, relatively straight- 
forward approaches for computing a prior sample: 

1. By taking the eigendecomposition of P or P _1 , 

2. Using a Cholesky decomposition, as in Section 11.2, 

3. Using the FFT, as in Section 11.2.1, 

4. Using a marching method and sampling dynamically, as in Section 1 1.2.2. 

Compute NxN prior samples using each of the above four methods, and discuss 
the computational complexity as a function of 4 < N < 256. Report results only 
for practical values of TV; the eigendecomposition, in particular, will quite rapidly 
become infeasible. 

Problem 11.3: MCMC Sampling 

The Ising model is one of the simplest discrete- state models: 

1 . Implement a basic Gibbs sampler. 

2. Implement a vectorized Gibbs sampler, based on a checkerboard approach. 

For both methods we initialize with a random binary field, and then iterate the 
Gibbs sampler. With these basic tools in place, do the following: 
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(a) Quantify the speed difference, per complete iteration (one sampling of every 
pixel in the lattice), between the basic and vectorized approaches. Whether 
there is a difference, and the magnitude of the difference, will likely be 
highly dependent upon the environment (C, Matlab, Octave, etc.). 

(b) Run the Gibbs sampler for a few different values of coupling < (3 < 1. 

(c) For highly coupled fields ((3 « 1), the correct sample is obvious: a constant 
field (either all +1 or all -1). Not only is this obvious to us, but the energy 
function H also strongly favours the constant field. How rapidly does the 
Gibbs sampler approach all constant values? 

(d) Explain in some detail why the convergence is so slow in (c), despite the fact 
that the energy function strongly desires a constant field. 

(e) Suggest approaches for accelerating the convergence in (c). 

Problem 1 1.4: Open-Ended Real-Data Problem — Annealing 

Typed text on a page does not obey any simple statistical model, since letters 
have rather complex shapes. However, one prior we do have is that each image 
pixel is binary: black or white. 

Let Z be a binary bitmap of text (either captured from the screen, or scanned 
from a page), and let Mi be the ith measured image, downsampled from Z. The 
problem is parametrized by 

• The degree of downsampling from Z to Mi, 

• The number q of measured images M iy 1 < i < q, 

• The variance a 2 of the added measurement noise. 

Use Simulated Annealing to solve for estimates Z. Your solution will need to 
consider the following: 

• The annealing schedule Tk, 

• The choice of prior model (for example from Section 7.4), 

• The relative weight, in the Gibbs energy, between prior and measurement. 
Begin with a modest degree of downsampling (2 x 2) and a simple prior (Ising). 



Part IV 



Appendices 



Algebra 



This appendix contains a brief summary of matrix notation, definitions, identities, 
and common transformations. The following notation will be used throughout: 

A matrix 

dij the i , jth element of matrix A 

\A\ matrix determinant 

A T matrix transpose 

A H matrix conjugate transpose 

A -1 matrix inverse 

A + matrix pseudoinverse 

ir(A) matrix trace 

A [A] n x 1 reordering of matrix to column vector 

|| A || matrix norm 

A more comprehensive summary of notation may be found in the Nomenclature 
section on page XVII. 



A.l Linear Algebra 

Our focus in this appendix is on the properties and manipulation of matrices [73, 133, 
141, 162,233, 313]. We can interpret a matrix as a linear operator, transforming from 
one space to another: 

y_=Cx CeR kxn =* C:xeR n — ^rf, (A.l) 

thus an understanding of matrices requires an understanding of spaces. 

A Linear Independent set of vectors fe}, x { ^ is one in which none of the 
vectors can be written as a linear function of the others. If it is possible to express 
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such a relationship, such as 

Zj^^aiXt, (A.2) 

then the set is said to be linearly dependent. If the set is linearly independent, then 
from (A.2) it follows that 

J2 Pi%i = fi => & = Vi. (A.3) 



A Sub SPACE is a subset V of a vector space such that two properties hold: 

1. Origin: G V, 

2. Convexity: Vx l5 x 2 G V, ax 1 + /3x 2 G V Va, (3 G R. 

We will concern ourselves only with subspaces of multidimensional real vector 
spaces R n . 

A Span is the set of weighted sums induced by a set of vectors: 

Span({^}) = y^ajXj (A.4) 

Any span is a subspace. 

A BASIS for a subspace is a linearly-independent set of vectors which span the sub- 
space. The simplest basis elements for the multidimensional real spaces R n are 
the unit vectors 

e x = [10 ... 0] ••• e n = [0 ... 1]. (A.5) 

The Dimension dim(V) of a subspace V is the number of vectors in a basis for V. 
Thusdim(R n ) = n. 

The Null Space of a matrix C is the set of elements which C maps to zero: 

Nu(C) = {x\Cx = Q}. (A.6) 

Every nullspace is a subspace, since 

x-l,x 2 G Nu(C) — > C{ax 1 + (3x 2 ) = aCx ± + f3Cx 2 = 0. (A.7) 

The null space is a measure of the degeneracy of C, in the sense of the size of the 
subspace which is mapped to a single value: 

Cx = l => C(x + ax) = 1 Vx GNu(C). (A.8) 
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The Range or column-space of a matrix C is the subspace spanned by its column 
vectors. That is, given 

C=[c 1 ...c n ], (A.9) 

then Ra(C), the range of C, is 

Ra(C) = Span({cJ) = {Cx \ x G R n }. (A. 10) 



The Rank of a matrix C equals the number of linearly independent columns of C, 
or equivalently the number of linearly independent rows. For a matrix C eR kxn , 

• C has full rank if Rank(C) = min(fc, n), 

• If k < n, we say that C has full row rank if Rank(C) = k, 

• If k > n, we say that C has full column rank if Rank(C) = n. 

Because the rank counts the number of linearly independent columns, it is a mea- 
sure of the dimension of the space into which C projects: 

Rank(C) = dim(Ra(C)). (A. 11) 

Similarly, the rank also counts the number of linearly independent rows. Further- 
more x G Nu(C) only if x is orthogonal to all of the rows of C, leading to the 
rank-nullity theorem 

Rank(C) + dim(Nu(C)) = n. (A.12) 

Therefore if Nu(C) = then C must have full column rank. 

Rank can never be increased by matrix multiplication, therefore 

Rank(AB) < min{Rank(A),Rank(£)}. (A.13) 

Finally, if C G R fexn ,/c > n, has full column rank, then 



• 



Rank(C C) = n and C C is a smaller n x n square, invertible matrix; 



Rank(CC ) = n and CC is a larger k x k square, singular matrix. 



A.2 Matrix Operations 



Matrices may be manipulated in many of the same ways as real scalars, but with 
certain exceptions. The transpose of a matrix is the reflection of the matrix across its 
diagonal: 
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B = A 1 



bi.j — a j 



(A. 14) 



Thus the transpose of a column vector is a row vector. The sum of two matrices A, B 
is an element-by-element sum: 



C = A + B 



Ci.j — 0*1, ■ 



(A.15) 



where A and B must be of the same size. The product of two matrices is more 
complicated, 

C = A • B => a ik = 5^a*,A\fc (A. 16) 

3 

where there is an implied condition on the matrix sizes that the number of columns 
in A must equal the number of rows in B. In general, matrix multiplication does not 
commute, meaning that A-B ^ B- A. We will also have use for element-by-element 
operations 



C = AQB 
D = A0B 



di. 



cbij • bij 
ctij/bij 



(A. 17) 



The inverse of square matrix A can be defined, such that 



B 



=> AB = B-A 



1 
1 

••• 



•• 
1 



(A.18) 



All non-square and many square matrices do not have an inverse, although more 
general notions of matrix inverses can be defined, as discussed in Section A.9. 

For an n x n square matrix A, the condition of invertibility is the matrix determinant 

det(A) = \A\=Y[Xi, (A.19) 

i 

where the A^ are the eigenvalues of matrix A, as discussed in Section A.7.1. Then 

A singular <^> A -1 does not exist A nonsingular <^> A -1 exists (A.20) 

& \A\ = o \A\ ^ (A.21) 

<f> 3i 3 Xi = & Xi ^ Vz (A.22) 

O Nu(A) ^ {0} ^ Nu(A) = {0} (A.23) 

O Rank(A) < n O Rank(A) = n (A.24) 

Various algorithms are known [141, 162] for computing matrix determinants. It is 
important to realize, however, that the numerical determination of matrix singularity 
is very difficult because an infinitesimal perturbation of a singular matrix will make 
it nonsingular. That is, numerical rounding errors make the precise computation of 
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a determinant difficult, and it is therefore similarly difficult to determine singularity 
based on a comparison of the computed determinant with zero. 

Related to the determinant is the matrix trace. For an n x n square matrix A, the trace 

n 

tr(A) = ^a,, = ^A,. (A.25) 

i=l i 

Because trace and expectation (Appendix B.l.l) are both linear they commute, al- 
lowing a trace to simplify certain expectation expressions. 



Matrix Addition / Multiplication Identities: 

A + B = B + A 

(A + B) T = A T + B T (AB) T = B T A T 



(A.26) 

(A.27) 



Matrix Trace Identities: 

tr(A + B) = tr(A) + ti(B) tr(kA) = k tr(A) (A.28) 

tr(AB) = tr(BA) tr(ABC) = tr(BCA) = tr(CAB) (A.29) 



Matrix Determinant Identities: 
|AB| = |A|-|S| 



l^- 1 ! 



l 



\kA\ = k n \A\ 



det(J - AB) = det(J - BA) 



(A.30) 
(A.31) 



Block-Matrix Determinants: 



det 



det 

AB~ 
CD 



a b 
c d 



ad — be 



det 



AB 
D 



\A\-\D\ 



\D\ ■ \A - BD~ l C\ = \A\ ■ \D - CA- X B\ 



(A.32) 
(A.33) 



Matrix Inversion Identities: 



{AB)- 1 =B~ 1 A- 1 (A- 1 f = (A T r 1 
(I-AB)- 1 A = A(I-BA)- 1 



(A.34) 
(A.35) 
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Block-Matrix Inversions: 

-l 



a b 


c d 


AB 


D 


AB 


CD 



d -b 

ad — be \_—c a 

'A' 1 -A- l BD- Y 
D' 1 



'A' 1 -^A^BS^CA- 1 -A^BS^ 1 



-S^CA' 1 






-S^BD- 1 



-D^CSn 1 D^+D^CS^BD- 1 



(A.36) 

(A.37) 

(A.38) 



where S A = (D - CA~ X B), S D = (A- BD^C). Comparing the last two equiv- 
alent forms for block-matrix inversion leads to the ABCD lemma 



(A + BCD)' 1 = A' 1 - A^BiC' 1 + D A' 1 B)' 1 D A' 1 . 



(A.39) 



A.3 Matrix Positivity 



For scalar values, notions of positivity, negativity, and relative size are unambiguous. 
For example, 

- 1< 3.5 > 2.14 < 4.8. (A.40) 

However, for matrices notions of inequality and positivity are much less clear; matrix 
element-by-element comparisons are, in most cases, not very useful. Instead, we say 
that a matrix A is 



Positive definite 



A>0 If x 1 Ax > \/x ^ 



Positive semidefinite A > If x Ax > Vx. 
Negative semidefinite A < If x Ax < Vx 
Negative definite A <0 If x^ Ax < \fx ^ 

Given the eigendecomposition (Appendix A.7.1) for A 

Mu = Hi^i or AV = VA 
then 

x T Ax = x T VA V T x = (V T x) T A (V T x) = z T Az = ^ ^l 



(A.41) 



(A.42) 



Therefore the positivity of A is directly related to the signs of the eigenvalues of A: 
A > iff Xi > Mi A > iff Xi > Vi. (A.43) 
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Matrix inequalities can then be interpreted as the positive definiteness of a matrix 
difference: 

A>B^(A-B)>0 C <D^ (C-D) < 0. (A.44) 



Positive definiteness identities: 

A>0^ -A<0 A>0=> -A<0 (A.45) 

A>0,B>0=> A + B>0 A>0,B>0=> A + B>0 (A.46) 

A > => A' 1 > A < => A' 1 < (A.47) 

A > => B T AB >0 V5 A > => B T AB > Vinvertible B 

(A.48) 

Computing all of the eigenvalues of A is one (computationally demanding) way to 
test its positive-definiteness. One other method is Sylvester's test [162, 233]: 

Given n x n matrix A, define 

an • • • Q>ii 
di = det(Ai) where A* = : : (A.49) 

_0>il ' • ' &ii_ 

Then A is positive-definite if and only if di > for alH = 1, . . . , n. 

A.4 Matrix Positivity of Covariances 



All covariance matrices (Section B.4) must satisfy A > 0. The reason for this con- 
straint, and a parallel interpretation of the meaning of matrix positivity and inequal- 
ities, is most easily seen in the context of random vectors (Appendix B.1.3). First, 
given a random vector z with covariance A, we can calculate the variance of any 
scalar linear function of z: 

var(w T z) = w T Aw. (A.50) 

A variance must, by definition, be non-negative, therefore we have 

w T Aw >0Vw => A>0. (A.51) 

That is, all covariances must, by definition, be positive- semidefinite. 

Next, let us consider a matrix inequality, in which one matrix is "larger" than the 
other. Suppose we have two random vectors with corresponding covariances: 

x ~ A x y_~ A y where A x > A y . (A.52) 
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\ / 
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-1 


X. ^/ 



-0.5 0.5 

Correlation Coefficient p , p 

Fig. A.l. The shaded area shows the range of valid correlation coefficients between random 
variables a, c, given the strength of the relationship with intermediate variable b. If a, b and 
6, c are uncorrelated (p a b = Pbc = 0), then the relationship between a, c is completely un- 
constrained. As a, b and 6, c become more strongly correlated (\p a b\ = \pbc\ — ► 1), the range 
of valid correlation between a, c becomes increasingly constrained. 



The matrix inequality in the co variances implies that 



w A x w>w A y w 



var(u> x) > y8ly(w y). 



(A.53) 



That is, a "larger" co variance implies a larger variance in any linear function of the 
corresponding random vector. 

Finally, the positive-definiteness requirement for a covariance can also be interpreted 
intuitively. Suppose I have three random variables a,b,c, where the ab and be corre- 
lations are known; what does this imply about the ac correlation? 



correl. 7 7 correl. 

a < — > b b < — > c 



??? 
a ^^ c 



(A.54) 



We can ask how p a ^ c , the correlation coefficient between a and c, is constrained by 
the given correlation coefficients p a ^, p^^. If a, 6, c have unit variance, then 



1 


Pa,b 


? 


Pa y b 


1 


Pb,c 


? 


Pb,c 


1 



(A.55) 



As the a <^> b and b <^> c relationships become stronger (i.e., as \p\ — ► 1), the 
relationship between a and c becomes increasingly constrained. Figure A. 1 plots the 
permitted range of values for p ayC as a function of p a ^ = p^^. The range of valid 
values for p a ^ c are precisely those which preserve the positive-definiteness of the 
3x3 covariance in (A.55), and the filling in of such unknown values is called the 
covariance extension problem [48]. 
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A.5 Matrix Types 

There exists a wide variety of terms associated with matrices and their proper- 
ties [141, 162, 163]; the most commonly used are summarized in Table A.l. Except 
where stated otherwise, the terms all relate to square matrices of size n x n. Those 
definitions which admit a particularly simple visual interpretation are also sketched 
in Figure A. 2. 



A.6 Matrix / Vector Derivatives 



It is possible to define derivatives of a scalar function /(), vector function /(), or 
matrix function F() with respect to a scalar x, vector x, or matrix X. In particular, 
there are five cases of interest: 

Vector derivative of a scalar function or vice- versa: 



dx 



dxi 



df 

dx n 



dx 



dx 



dfk 
dx 



Derivatives of or by a matrix: 

df df 

dx\\ ' dxii 



III. 



ox 



df df 

dx kl ' ' ' dx kn 



IV. 



OF 

dx 



dfn 
dx 



dx 



dfln 

dx 



dx 



(A.56) 



Vector Derivative of a vector function: 

■ dfl dfl 

dx\ ' ' ' dx n 



v. 



di 

dx 



df k 
dx\ 



dx n 



where we have assumed vectors /, x to be column vectors. 

All of these definitions are straightforward, with the exception of case V where it is 
ambiguous whether the elements of £ index the columns or the rows. In particular, 
the transpose operation commutes with differentiation 

df _ fdjV df_ _ fdf\ T _d£_ _ (df_\ T dF^_ _ / dF^ ~ 
dx T ~ \dx) dx ~\dx) dX T ~ \dX J dx ~ \dx 

(A.57) 
in all cases except case V. As a further complication, there is no universal agreement 
on row-column conventions for cases I and V, and unfortunately different authors 
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Term 



Definition / Comments 



* Diagonal 


a^ = 0, i ^ j 


* Upper Triangular 


a^ = 0,i < j (Strictly Upper for i < j) 


* Lower Triangular 


a^ = 0,i > j (Strictly Lower for i > j) 


* Symmetric 


a^ — aji or A — A T 


* Skew-Symmetric 


aij — —aji or A— —A T 


* Hermitian 


a^ — a*i or A — A H 


Singular 


\A\ = 


Special 


\A\ = 1 


Normal 


AA H = A H A 


Orthogonal 


AA T = A T A = I 


Unitary 


AA H = A H A = I 


* Circulant 


aij = Cl(j_)_^) mod n,(j + k) mod n v/C 


Diagonalizable 


BAB -1 is diagonal for some B 


Similar 


A similar to C if C = BAB' 1 for some B 


Unitarily Similar 


A unitarily similar to C if C = BAB -1 , unitary B 


Positive-Definite 


x H Ax> Oforallx/0 


Positive-Semidefinite x A x > for all x 


Idempotent 


AA = A 


Nilpotent 


A k = for some positive integer k 


Involutary 


AA = I 


* Stochastic 


Ej a ij = 1 


* Selection 


Each row has one 1.0, the rest zeros 


* Permutation 


Each row and column has one 1.0, the rest zeros 


* Toeplitz 


A matrix with constant diagonals 


* Hankel 


A matrix with constant anti-diagonals 


* Hadamard 


A binary matrix a^ = ±1, where AA T = A T A = nl 


* Hilbert 


aij = l/(i + j- 1) 


* Pascal 


An integer matrix with integer inverse 


* Vandermonde 


Each row is a geometric sequence: a^ — a™~ J 


Metzler 


aij > for i / j 


Householder 


Any n x n matrix (I — 2v v / (v v)),v nonzero 


* Jordan 


A block diagonal matrix with Jordan blocks 


Jordan Block 


aij = except a^i = a, a^+i = 1 


* Wronskian 


The zeroth to (n — 1) derivatives of a vector function 


* Jacobian 


The vector derivative of a vector function 


* Hessian 


The second vector derivative of a scalar function 



Table A.l. Common named matrix types (starred entries sketched in Figure A.2) 
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Diagonal 



'x X X 


X 


x~ 


X X 


X 


X 


X 


X 


X 




X 


X 
X 



Upper Triangular 



X 








X 


X 






X 


X 


X 




X 


X 


X 


X 


X 


X 


X 


X X 



Lower Triangular 



~a b 


c d~ 


b e 


f 9 


c f 


h i 


d g 


i 3- 



Symmetric 



b c d 

-b / g 

-c -f i 

-d -g -i 

Skew Symmetric 



abed 
b* e f g 
c* f* h i 

d* g* i* j 

Hermitian 



1" 

o o y 2 1/2 

1/3 2/3 

1/4 V2 1/4 0. 

Stochastic 



Selection 



1 



Permutation 



a b c d~ 

d a b c 

c d a b 

b c d a_ 

Circulant 



'a b c d~ 

e a b c 

f e a b 

g f e a 

Toeplitz 



abed' 
bade 
c d e f 
d e f g 

Hankel 



y? 



1111 
1 1-1-1 
1-1-1 1 
1-1 1-1 

Hadamard 



f3 3 f3 2 (3 1 f3° 

7 3 7 2 7 1 7 

^ 3 S 2 S 1 6° 



Vandermonde 

Vn(t) 



yi(t) 
(t) 



y ( n\t) 



(n — 1) / ,\ 



(n-l) 



WJ 



'I 1/2 % 1/4 

V2 l h l k % 

% 1/4 y 5 ye 

1/4 y 5 ye Vt. 

Hilbert 



cos(6>) 
sin(0) 



sin((9) 
cos(6>) 



Givens 



Q2/1 






9yi 



dy n 

dx r , 



ri 


1 


1 


1 


1 


2 


3 


4 


1 


3 


6 


10 


_i 


4 


10 


20 




Pascal 




a 


1 








a 







13 1 
/3 1 

Jordan 






gjf 



gj£ 

dx\dx n 



d z f 



Wronskian 



Jacobian 



dx n dx\ dx n dx n 

Hessian 



Fig. A.2. Examples of commonly used matrix types from Table A. 1 . A blank area in a matrix 
implies zeros. The figure lists only illustrative examples. 
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Case Derivatives of Linear Functions 



V 


dx 


I 


£(^)=- 


V 


—Ax = A 

ox 


V 


9 A A d 

ox ox 



III -^(a T Xb) = ab T 

OX ox ox 



Case Derivatives of Quadratic Functions 



I 


— — (x x) = 2x 
ox 




I 


-^-(x T Ax) = x T (A + A T ) 
ox 




I 


S-(Ax + b) T Q(Ax + b) = 2A T L 


Ax + b) 



dx 

HI -£=(a T X T Xb) = X(ab T + ba T ) 
oX 

HI -^-p (a T X T Xa) = 2Xa a T 
oX 

HI -S^ (a T X T CXb) = C T Xa b T + CXb a T 
oX 

Table A.2. Vector and matrix derivatives of linear and quadratic functions [46, 257] 



may use opposing conventions, thus the derivatives presented here may, in some 
cases, need to be transposed to fit with other conventions. A summary of useful 
derivatives is shown in Tables A. 2 and A.3. 

There are three matrices commonly connected with derivatives: the Jacobian, the 
Wronskian, and the Hessian, all of which are illustrated in Figure A.2: 

Jacobian: If y_ is a vector function of x, then the derivative dy_ /dx is the Jacobian 
matrix of y_ with respect to x. The determinant \dy_ /dx\ is also referred to as the 
Jacobian, essentially a normalizing constant in the change of variables from x to 
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Case Derivatives of Matrix Inverses 

IV |_y-i = _y-i^y-i 
OX ox 

ni -^(g^x- 1 b) = -x- T ab T x~ T 

oX 
Case Derivatives of Matrix Determinants 

in A| X | = |x|.x- T 

HI -^-\AXB\ = \AXB\-X~ T 

III ^-\n(\AXB\) = A T (AXB)- T B T = X- T 
oX 

Case Derivatives of Matrix Traces 

HI ±«X) = £tr(X*) = I 

III ^tr(XA) = ^tr(AX) = A T 

III -£-tx(AXB) = A T B T 
oX 

III ^-tr{XAX T ) = X(A + A T ) 
oJi 

in ^< x ~ 1 ) = -x~ T x- T 

Table A.3. Vector and matrix derivatives of matrix inverses, determinants, and traces [46,257] 



Wronskian: If y_ is a vector function of scalar t, then the matrix built by assembling 
the rows y} \t) ... y} n ~ ' (t) is known as a Wronskian of y_. 

Hessian: If / is a scalar function of vector x, then the second-derivative matrix 
d/dx (df/dx) is known as the Hessian matrix of f(x). The positive-definiteness 
of the Hessian relates to the extremal properties (minimum, maximum, or saddle 
point) of the function. 



A.7 Matrix Transformations 



A wide variety of matrix transformations exist [141], many of which are used in 
solving linear systems, normal equations, or least- squares problems, all of which 
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Page Transformation Matrix Assumptions Purpose of Transformation 



396 


Eigendecomp. 


Square 


Matrix diagonalization 


400 


SVD 


None 


Matrix orthogonalization 


401 


Cholesky 


Positive-Definite 


Square roots, Linear systems, 
Least squares 


402 


Gauss 


Any 


Matrix element zeroing 


402 


Gauss Elimin. 


Nonsingular 


Linear systems, Matrix inversion 


403 


LU 


Square 
Matrix representation 


Linear systems, Matrix inversion, 


404 


Gram-Schmidt 


Any 


Vector orthogonalization 


405 


Householder 


Any 


Matrix element zeroing 


406 


Givens 


Any 


Matrix element zeroing 


406 


QR 


Full-column rank 


Representation, Least squares 


407 


Schur 


Square 


Representation 



Table A.4. An overview of the matrix transformations discussed in Section A.7 



are of interest in this book, particularly in Chapters 8 and 9. The transformations 
discussed in this section are listed in Table A.4. 



A.7.1 Eigendecompositions 



The eigendecomposition is one of the most fundamental and powerful matrix tools 
in all of linear algebra. 

For any square matrix A, an eigenvector v and corresponding eigenvalue A must 
satisfy 

Av = v\. (A.58) 

That is, the eigenvectors point in those special directions which are invariant to the 
repeated application of linear operator A, greatly simplifying analysis: 



A-...-Av = A q v = A^tvX) = v\ q . 

q times 



(A.59) 



Since the eigenvalues determine many of the properties of a matrix, the distribution 
of eigenvalues {A^} is known as the spectrum of a matrix. 
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For annxn matrix A there are n eigenvalues and eigenvectors, however for certain 
matrices there can be redundancy (multiplicity) among them. However, if A is real 
and symmetric (such as a covariance), then the n eigenvalues are real and the n 
eigenvectors are linearly independent and orthogonal. That is, the eigenvectors form 
a basis for R n , allowing any n-dimensional vector x to be expressed in terms of the 
eigenvectors: 



^2®^, 



(A.60) 



where the orthogonality of the eigenvectors allows the weights to be easily calcu- 
lated: 



T 
Va X 



T 
Va Va 



(A.61) 



The transformation (A.60) simplifies the analysis for linear operations on any vector: 



• • • Ax = A q V^ OLiV^ — 2_, a i^iV-i- 



(A.62) 



q times 



i=l 



i=l 



Normalizing the eigenvectors to unit length, we can write the n eigenvectors and 
eigenvalues in matrix form 



V 



L^i 



A 



Ai 
A 2 











(A.63) 



••• A n 
such that A is a diagonal matrix and V is an orthogonal matrix 

V T V = VV T = I. 
With this notation the eigendecomposition (A.58) can be rewritten as 
Ay* = 2Li\ => AV = VA, 



(A.64) 



(A.65) 



from which it is particularly easy to derive the similarity transformation for matrix 
diagonalization and related expressions: 



V T A V = A 



VAV T A' 1 = VA- X V T A q = VA q V T . (A.66) 



Fundamentally, eigendecompositions are about taking a coupled linear system and 
decoupling it into individual modes (the eigenvectors) which evolve independently, 
and where the associated eigenvalue describes the behaviour of the associated mode. 
For example, suppose we have a mechanical system of n masses connected by 
springs: 

x(t)=Kx(t). (A.67) 
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This is just a rewriting of Newton's F = ma, where the acceleration (x) is written 
in terms of force/mass (Kx). By finding the eigendecomposition of K, 

K = V K A K Vg, (A.68) 

the coupled problem of (A. 67) can be transformed 

u(t) = V£x(t) =* yi{t) = \ iyi (t), (A.69) 

a set of n single-spring/mass systems whose solution is simple: 

yi (t) = yiifi) cos (t v /Z Al) • (A.70) 

Further illustrations are given below for dynamic and co variance matrices. 

The significance and power of the eigendecomposition should make it no sur- 
prise that there are many related concepts and extensions. Related concepts include 
the Cholesky and QR decompositions (Appendix A.7.3), positive definiteness (Ap- 
pendix A.3), and matrix square roots (Appendix A.8). 

Extensions include the Jordan form for non-diagonalizable matrices (not discussed 
here), the Schur decomposition for complex matrices (Appendix A.7.3), the singu- 
lar value decomposition for rectangular matrices (Appendix A.7.2), and generalized 
eigendecompositions, in which we seek solutions for v, A to 

Av = Bv\, (A.71) 

where normally B is not invertible, a context which can arise in singular estimation 
problems, but which does not arise in this book. 

Eigendecomposition and Dynamic Matrices 

If n x n matrix A is a dynamic matrix, describing the evolution of x as 

x{t+l) = Ax(t), (A.72) 

then the stability of the iteration is determined by the eigenvalues of A. Specifically, 
the spectral radius, the largest eigenvalue magnitude 

p(A) = max | Ai|, (A.73) 

i 

satisfies p(A) < 1 for stable systems and p(A) > 1 for unstable ones. 

Given an initial condition x(0), then x(t) = A*x(0) is an iterative calculation, with 
a complexity increasing with t. Instead, the eigendecomposition A = VaAaVJ 
allows the coupled dynamics of (A.72) to be decoupled into individual modes, based 
on the eigenvectors, which evolve independently, thus 
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x(0) -^-> iz(0) = Vjx(0) 

(A.74) 

x(t) <-^- U (t) = A\m 

leads to a closed-form solution x(t) — VA t V T x(0), where the complexity of com- 
puting A 1 is fixed, since A is diagonal. 

If the dynamics matrix describes the evolution of an error, 

e(t + 1) = Qe(t) (A.75) 

such as in the iterative linear-system solvers of Chapter 9, then we are interested in 
the rate at which the error decays to zero. Because 

e(t) = Q*e(0) (A.76) 

from (A. 62) we know that 

n 

e(t) = Q'e(O) = ^(«f e(0))A^. (A.77) 

i=l 

That is, each eigenvector y_ { describes the shape or form of the error which decays at 
a rate controlled by A^ . 



Eigendecomposition and Covariance Matrices 

In the context of this book we are frequently concerned with the interrelationship of 
variables in a random vector, a set of relationships described by a covariance matrix 
(Appendix B.1.3). Given a set of n coupled random variables, the eigentransforma- 
tion decouples them: 

x~P => y_ = V P r x~ A P . (A.78) 

If the joint distribution p(x) is Gaussian (Appendix B.3), then the eigenvectors of P 
point along the principal axes of the ellipsoid characterizing the multivariate distri- 
bution, and the eigenvalues represent the variances along those directions. 

The spectrum, or distribution of eigenvalues, is useful in at least two ways: 

• Even for non-Gaussian distributions, the relative sizes of the eigenvalues of a 
covariance give an indication of the degree to which a multivariate distribution is 
constrained in various directions. 

• The conditioning of matrix P, which strongly affects numerical stability and rates 
of iterative convergence, is a function of the largest and smallest eigenvalues. As 
the smallest eigenvalue approaches zero, ever- smaller numeric perturbations lead 
to matrix corruption and a failure of positive-definiteness. 
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A.7.2 Singular Value Decomposition 



Given any real matrix A, the Singular Value Decomposition (SVD) expresses A as 

A = USV T , (A.79) 

where U, V are orthogonal matrices and where S is a diagonal matrix, the same 
dimensions as A, containing the singular values along the diagonal: 



S = _^ 1 or S= a 2 u (A80 ) 



<7i 

CT2 




or 


S = 






(71 

^2 















where, by convention, the singular values are ordered g\ > o"2 > • • • > 0. 

For real symmetric matrices the SVD and eigendecomposition are nearly equivalent. 
Representing a co variance A in both ways, 

and A = QAQ T (AM) 

Eig. 

by equating the SVD and eigendecomposition expressions we find that 

Mi =SU a i = \ X i\ Ui = sign(A i )^. (A.82) 

That is, requiring the singular values to be positive causes the sign of the eigenvalue 
to be absorbed into one of the orthogonal matrices U, V. 

For real nonsymmetric matrices the SVD and eigendecompositions are quite differ- 
ent. Whereas many matrices may not diagonalize or may have complex eigenvalues, 
every matrix has a SVD with real, positive singular values. 

The behaviour of the SVD is most easily explained in the relationship between two 
zero-mean random vectors x, y_. First, the eigendecomposition of a covariance matrix 



x~E[xx T ] =VAV J 



E[(V T x)(V 7 



A 



(A.83) 



identifies the principal components (Section 8.2.1) of x: the set of mutually decorre- 
lated random variables y_ = V T x from x such that 



v_ { x is uncorrected with v_- x V j ^ i 
v_ { x has a variance of A^ . 

In contrast, the singular value decomposition of a cross-covariance 



(A.84) 
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E[x U T ] = USV T => E[(U T x)(V T y) T ] = S (A.85) 

identifies the principal components of the relationship between x and y_. In particular, 
given x eW 1 j y_eR k ^k < n, it follows that 

u { x, i < k is uncorrelated with Vj y_\/ j ^ i 

y^x, i > k is uncorrelated with q 1 y_ V £ (A. 86) 

u, x , i < k is correlated with v- y_ with strength or significance cri . 



For a square matrix A, the SVD can be used to calculate k{A), the condition number 
of A, which measures the closeness of A to singularity: 

k(A) = amU ( A J. (A.87) 

Thus for covariance matrices, which are symmetric, the equivalence (A. 82) between 
the SVD and the eigendecomposition allows the condition number also to be evalu- 
ated via eigenvalues: 

= ^^l (A.88) 

mm, | A, (A) | 

This latter form may be useful in stationary cases where the FFT can be used to cal- 
culate eigenvalues, or for specific priors for which the eigendecomposition is known 
analytically. 



A.7.3 Cholesky, Gauss, LU, Gram-Schmidt, QR, Schur 

The following ten matrix transformations are widely used and referred to. They are 
summarized here only very briefly; for a much more extensive discussion the reader 
is referred to the comprehensive text by Golub and Van Loan [141]. 

Cholesky Decomposition 

Given annxn positive-definite symmetric matrix A, 

The Cholesky decomposition finds a lower-triangular 1 matrix r with posi- 
tive diagonal elements, such that 

A = rr T . (A.89) 



1 Note: the default matrix returned by chol (A) in MATLAB is upper-triangular. To obtain 
the lower-triangular form, as in (A.89), use chol (A, ' lower' ) . 
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Clearly r is the matrix square root (Appendix A. 8) of A\ also the triangularity of r 
implies that r is easy to invert, leading to an efficient solution for the matrix inverse 

A~\ 

The most attractive aspect of the Cholesky decomposition is that fast, numerically- 
stable algorithms exist to compute it (see Algorithm 4 in Chapter 10). As all matrix 
covariances are symmetric positive-definite, the Cholesky decomposition is widely 
used in statistical processing (particularly for covariance inversion and square roots, 
as in Appendix A. 8). 



Gauss Transformation 



The Gauss transformation sets to zero all elements in a vector beyond some index i: 
Ann x n lower-triangular matrix G{ is a Gauss transform if 



KjCjX KJCj 



' Xi ' 




~x{ 


Xi 




Xi 


Xi+1 




U 


%n 




_0_ 



(A.90) 



The effect of Gi is to subtract multiples of row i from all following rows, specifically 
to subtract the multiple Xk/xi times row i from row k. The Gauss transformation is 
the elemental step in Gaussian elimination. 



Gaussian Elimination 

A direct, non-iterative approach to the solving of linear systems and matrix inversions 
can be realized by the repeated application of Gauss transformations: 

Given a linear system Ax = b, by applying Gauss transformations we zero 
the elements in A, such that in the resulting system Ux = b_ the matrix U is 
upper triangular. 



For example, given the linear system, written in equation or matrix form, 

x\ + 2x2 = 3 
x\+ x 2 = 5 



"1 2" 
1 1 




Xi 

x 2 


= 


"3" 

_5_ 



(A.91) 
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then by subtracting row one from row two we have 



Xi 



2x 2 

-x 2 



"1 2 * 
-1 




Xi 

_x 2 _ 


= 


"3" 
2 



(A.92) 



To perform matrix inversion we apply Gaussian elimination to the linear system 
AB = I where, given A, we seek the solution to B = A~ x : 

Gauss. Elimination v r //j lxxx^i 



A 




B 



Backsubsitution 



Normalization 



Ng 




B 




X 




B 




B 



A~ 



(A.93) 
The above presents only the basic, conceptual algorithm. There is a wide variety 
of practical considerations, such as row reordering (pivoting) for numerical stability 
and specialized algorithms for symmetric matrices. 



LU Decomposition 

LU factors a matrix into a product of lower and upper- triangular matrices: 



The LU decomposition of a matrix A is 

A = LU, 
where L is lower-triangular and U is upper-triangular. 



(A.94) 



By definition, after applying Gaussian elimination to A we have an upper triangular 
product 

•••G 3 (g 2 (GiA)) =U, (A.95) 

therefore 

A = (G^G^Gz 1 •••)U = LU, (A.96) 

where the product G^G^G^ 1 • • • is lower- triangular because each of the Gaussian 
transformations G{ is lower-triangular. 

It is important to realize that not every matrix admits an LU decomposition. However, 
for matrices which admit such a decomposition, the solving of linear systems is 
extremely efficient: 

Ax = b => L(Ux)=b => Ly_=b,Ux = y_ (A.97) 

where the triangular form of L, U allows the final equations to be solved easily by 
backsubstitution. 
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Gram-Schmidt Orthogonalization 

Gram-Schmidt is the orthogonalization of a set of linearly independent vectors: 

Given linearly independent a x , . . . , a n , find vectors b x , . . . , b n such that 

Spanfe, . . . , a n ) = Spanfe, . . . , b n ) bjbj =0Vi^j. (A.98) 

Alternatively, stated in matrix form, given A with full column rank, find ma- 
trix B such that 

Ra(A) = Ra(£) B T B is diagonal (A.99) 

The algorithm proceeds recursively, orthogonalizing a vector by removing compo- 
nents aligned with any previous vector: 

h = a 1 

h 2 = a 2 -^^b_ x (A. 100) 

23 — ^3 7 T 7 -2 ,T, -1 

^2 ^2 &1 &1 



Alternative, numerically-robust forms of this method exist [141]. 



Conjugate Gram-Schmidt Orthogonalization 

A modification of the preceding Gram-Schmidt procedure to conjugate-orthogon- 
alize a set of linearly independent vectors: 

Given matrix M and linearly independent a x , . . . , a n , find 6 X , . . . , b n such 
that 

Span(a l5 . . . , a n ) = Spanfe, . . . , b n ) b?Abj=0Vi^ j. (A. 101) 

The conjugate algorithm proceeds similarly to usual Gram-Schmidt, with vector 
conjugacy assessed in a reshaped space, modified by M\ 
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b 1 = a 1 (A. 102) 

b 2 = a 2 - % Mkl b t (A.103) 

b<i = ao- -=% — =^-6 2 - -^ — =M X (A.104) 

" 3 " 3 b T 2 Mb 2 " 2 6f M&! 



Such matrix-conjugate vectors are of key interest in conjugate gradient and related 
Krylov methods (discussed in Section 9.2.3). 



Householder Transformation 

The Householder transformation reflects a vector (or matrix column) across a hyper- 
plane to set to zero all but one element of the vector: 

Ann x n matrix 

T 

H = I-2^=r- v^O (A.105) 

V V 

is called a Householder matrix or reflection, such that Hx reflects x across 
Span(v) x , that is, across the hyperplane with normal vector v. 

Suppose that we wish to set to zero all except the first element of a vector x (normally 
the column of a matrix). That is, we wish to find a transformation 

/ vv T \ 

Hx= [I- 2^=r- \x = (3e 1 (A.106) 

\ 2L v J 

such that Hx is a multiple of the first unit vector. This is accomplished by setting 

v = x± (^kW (A. 107) 

The appeal of the Householder reflection lies in the simple form of v. 

Note that in applying this transformation to all of the columns of a matrix, the House- 
holder matrix H is never explicitly computed, rather the transformation is calculated 
directly from v [ 1 4 1 ] . 
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Givens Rotation 

A Givens rotation sets a single vector (or matrix column) element to zero: 

Annx n Givens matrix G(i, j, 0) rotates the (i,j) vector elements by angle 
0, where is normally selected such that 

U = G(i,j,0)x => Vj =0. (A. 108) 

A Givens matrix G(i, j, 0) is equal to the identity /, except for a 2 x 2 rotation matrix 
in the four elements indexed by i, j: 



cos(6) s'm(6) 
-sin(#) cos(6) 



Row i 

Row j 



(A. 109) 



Column i Column j 
Clearly only elements i, j of a vector are affected in the product 

{x k k^ij 

Xi cos(0) — Xj sin(0) k = i (A. 110) 

Xi sin(0) + Xj cos(0) k = j 

We can set the jth element to zero by selecting 

cos(#) =• f Xi sin(0) = Xj (A.lll) 

rp^ I ry*^ I ry^ I rf^ 

i > 7 \l i i 



QR Decomposition 

The QR decomposition expresses a matrix as the product of orthogonal and upper- 
triangular matrices: 

The QR factorization of ak x n matrix A is given by 

A = QR, (A. 112) 

where Q is orthogonal and R is upper triangular. 

The QR decomposition, a relatively complicated algorithm, can be accomplished us- 
ing the Householder, Givens, or Gram-Schmidt transformations, and plays a central 
role in computing Schur decompositions. 
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Schur Decomposition 

The Schur decomposition or factorization of a matrix is an alternative to the eigen- 
decomposition: 

Given a real, square matrix A, the Schur factorization of A is 

A = QUQ T , (A.113) 

where Q is orthogonal and U is an upper- triangular matrix (possibly with 
diagonal blocks) with the eigenvalues of A appearing along the diagonal of 
U. 

Given a complex, square matrix A, the Schur factorization of A is 

A = ZUZ H , (A. 114) 

where Z is unitary and U is as before. 

Clearly if U is a diagonal matrix, then the Schur form is equivalent to the regular 
eigendecomposition. 



A.8 Matrix Square Roots 

In general, if a matrix P can be expressed as 

P = T T r (A.115) 

then r is defined as a matrix square root of P. This expression is possible for all 
positive-semidefinite P (that is, for all covariance matrices), however the choice of 
r is not unique. Indeed, any orthogonal transformation of r remains a square root: 

r = ur => r T r = r T u T ur = r T r = p (A.116) 

where U is any orthogonal matrix, U T U = I. 

The matrix square root is most easily expressed in terms of its eigendecomposition: 

p = vav t = (va^A (va x 'A => r= (va 1 /A t , (A.117) 

where A is a diagonal matrix of eigenvalues and A I 1 is then the diagonal matrix, 
taking the square root of each diagonal entry. 

If P is symmetric, positive-definite, then the matrix square root can be computed 
much more efficiently using the Cholesky decomposition (Appendix A.7.3). 

Square root matrices find three important uses: 
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1. As the implicit representation of a positive- semidefinite matrix: 

It can be numerically difficult to determine whether a given matrix is positive- 
semidefinite, and similarly difficult to guarantee that a numerical computation 
results in a positive- semidefinite matrix, although the consequences of inadver- 
tently losing positive-semidefiniteness can be striking (divergent algorithms, neg- 
ative error variances). 

Instead, if a matrix P (usually a covariance, either a prior model or an estimation 
error posterior) is represented by its square root 

p = r T r, (A. 11 8) 

then any manipulation of r =>> P 9 possibly including numerical rounding, yields 
a valid square root matrix P where 

P = P T P>0 (A.119) 

is guaranteed to be positive-semidefinite. 

2. As a numerically-robust representation: 

The condition number k(P) of a matrix P is a measure of conditioning, or nu- 
merical sensitivity. Given the eigendecomposition 

Pv* = ><iV.i or PV = VA, (A. 120) 

if P is symmetric, positive-definite 2 then 

= max^ ^ = = y = max^} 

mini {Xi\ min^ { \J \ } 

(A. 121) 

therefore 

k(VF) = \f^{P), (A.122) 

log 10 k(VP) = \ log 10 V^PJ (A.123) 

implying that the square root form requires only half 'the number of floating-point 
digits for an adequate, implicit representation of P. 

3. In random sampling: 

If a random vector x ~ Qi, P) has covariance P, then given the covariance square 
root 

P = P T P (A. 124) 

we can find a random sample of x as 

x = y L + r T w w~I. (A.125) 



2 Implying that eigenvalues and singular values are equal, allowing the discussion to be sim- 
plified. 
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A.9 Pseudoinverses 

Rectangular matrices do not, by definition, possess an inverse. However, it is possible 
to define pairs of rectangular matrices which satisfy certain aspects of inversion. 

In general, given a matrix A we might refer to its pseudoinverse as that matrix A + 
such that the product is as close as possible to the identity [141]: 

A + = arg x min \\AX - I\\ F = arg x min ^ (AX - I)\. , (A.126) 

a definition which is too vague and impractical, in general. 

The most common definition of a pseudoinverse asserts the Moore-Penrose condi- 
tions [4,141]: 

A + A = (A + A) T , AA + = (AA + ) T , AA + A = A, A+AA+ = A + 

(A. 127) 
implying that the repeated application of a pseudoinverse pair (AA + ) or (A+ A) 
leaves a product unchanged. 

If A has full rank, often encountered in inverse problems, then the pseudoinverse is 
computable in closed form: 

Full Row Rank => Nu(A T ) = {0} => A+ = A T (AA T )~ 1 => AA+ = I 
Full Column Rank => Nu(A) = {0} => A+ = (A T A)~ 1 A T => A+ A = I. 

(A. 128) 
The above definitions do lead to the sensible conclusion that the pseudoinverse ma- 
trix for an invertible matrix is, in fact, just the regular matrix inverse: 

A invertible => (A T A)~ 1 A T = A T (AA T )- 1 =A~\ (A. 129) 

Pseudoinverses are used in reductions of dimensionality, such as in Section 8.2.2, and 
also in the solution to the least-squares problems of Chapter 3. In particular, suppose 
that 

m = Cz + v E[v] = 0, cov(V) = I (A. 130) 

where C is any real matrix. Then the least- squares minimum-norm solution for z is 
given by [4] 

|=C + m, (A.131) 

valid whether C has full-row or full-column rank. That is, for a full-rank kxn matrix 
C: 

k = n => Nonsingular => z = C + m = C~ lf m Unique 

k < n => Underdetermined => z = C + m = C T (CC T )~ lr m Min. Norm 
k > n ^> Overdetermined => z = C + m = (C T C)~ l C T m Least Squares 

(A. 132) 



B 

Statistics 



This appendix provides a brief summary of univariate and multivariate statistics, 
co variances, and simple transformations of random variables. For a more detailed 
review the reader is referred to [37, 76, 99, 248, 284]. 



B.l Random Variables, Random Vectors, and Random Fields 



This section provides an overview of the quantities fundamental to this text, starting 
from random scalars, to random vectors, and then to random multidimensional fields. 



B.l.l Random Variables 



A random variable is a single scalar which is random. Random variables are typically 
either discrete (a random integer, for example) or continuous (a random real number). 

The nature of a random variable is characterized by its associated cumulative distri- 
bution function 

F x (t)=Pt(x<t). (B.l) 

Typically more convenient for continuous random variables is the probability density 
function (PDF) p x (), which satisfies 

F x (t) = Pt(x <r)= f p x (s) ds. (B.2) 

J — CO 

Only on occasion, when it is necessary to distinguish between a random variable and 
its particular instance, do we explicitly specify the PDF subscript; normally it will 
be understood from the context. 
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The expectation operation is defined as 

/oo 
f(x)p(x)dx, (B.3) 

-co 

where / is some mathematical function. Two very important choices of / lead to the 
definitions of mean and variance: 

Mean: [i x —E [x] 

Variance: a 2 x = E [x 2 ] - E [x] 2 = E [(x - E[x}) 2 ] . 

The square root of the variance, a x , is referred to as the standard deviation of x. 

Given independent samples x\ of random variable x, we can estimate the sample 
statistics 

Sample Mean: p, x = ± Y^Li %*> 

Sample Variance: a 2 — -^—^ J2i=i {%% — fix) • 

By far the most important and common PDF is the Normal or Gaussian distribution: 

1 1 ( x-Vx \ 2 

p(x) = —= — e 2 v -x ) . (B.4) 

y2n;cr x 



B.1.2 Joint Statistics 

A joint probability distribution p(x, y) characterizes the relationship of two random 
variables x and y: 

/a pf3 
/ p(x , y) dx dy . (B.5) 

-co J — oo 

A marginal distribution is derived from a joint one by integrating out one or more 
variables; for example 

/CO 
p(x,y)dy. (B.6) 

-co 

Expectations for multiple variables are defined similarly to (B.3): 

E [f{x, y)] = f f°° fix, y)p{x, V) dx dy. (B.7) 

J J — oo 

We say that x and y are independent if 
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p(x,y) =p(x)p(y), (B.8) 

in which case knowing x tells us nothing about y. We say that x and y are uncorre- 
lated if 

E[xy] = E[x]E[y}. (B.9) 

If x and y are uncorrected then they are not linearly related in any way, but may 
still be related in some nonlinear fashion. Independence implies uncorrelatedness, 
but not the other way around. 

The correlation between x and y is defined as 

E[(x-fi x )(y-fi y )] : (B.10) 

which reduces to the simple E[xy) if x and y have a mean of zero. Frequently more 
useful is a normalized version of the correlation, the correlation coefficient between 
x and y: 

E[{x - fi x )(y - fi y )] 

Px,y = • (ti-U) 

O'xO'y 

The correlation coefficient measures the ability to predict y as a linear function of x\ 
Px,y = implies no predictability (x and y uncorrelated), and p x ^ y = ±1 implies 
perfect predictability (x and y deterministically linearly related). It is always true 
that |p(x,y)| <1. 

We can generalize the above by introducing conditional statistics: 

p(x\y) is the PDF for x conditioned on another random variable y, 
p(x\A) is the PDF for x given that the event A took place. 

Joint, marginal, and conditional densities are related by B ayes' rule: 

Hly) P(y) P(y) ' ( } 

which applies to continuous and/or discrete random variables, thus 

Pi (x\y)p(y) = p(y\x) Pi (x) (B.13) 

for discrete x and continuous y. 

Finally we can also define conditional expectations, consistent with our previous 
definition: 

E[f(x)\y] = J f(x)p(x\y)dx. (B.14) 

Note that we're integrating only over x, not over y; in (B. 14) y is just a given piece of 
information, not a random variable. The variable of integration may be emphasized 
by writing the expectation as 

E x [f(x)\y]. (B.15) 
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B.1.3 Random Vectors 



The extension of random variables to random vectors is straightforward: a random 
vector x of dimension n is a column- vector of n random variables: 



Xi 



The PDF of x is the joint density function of its components: 



(B.16) 



Pr(xi < ri,...,x n < r n ) 



p(x) dx. 



(B.17) 



Note, however, that although x is now a vector, p(x) is still a scalarl 



Computing the joint density of a subset of the components of x is accomplished by 
computing the appropriate marginal distribution; for example 



p(xi,...,xi-i,xi+i,...,x n ) 



p(x) dx\. 



(B.18) 



Expectations of functions of random vectors are computed as before: 

E[f(x)] = J ... / f{x)p{x)dx 
which leads to the vectorized definitions 



(B.19) 



Mean: IL X = E 

Co variance: E x = E (x — u)(x — /i) T ] 

and the corresponding definitions for the sample statistics 

Sample Mean: j± x = j? T,?=i %i 

Sample Covariance: E x = j^ YhLi fe " k x ) fe - kxf • 

The covariance (also see Appendix B.4) is an n by n matrix, with an important 
structure: the (z, j)th entry in the matrix 



(U x )i,j = E[(Xi - lLi)(Xj - llj)] 



(B.20) 



is the correlation between x\ and Xj . Thus zero entries in E x imply a decorrelation 
of the corresponding two variables. 



We can also talk about the relationship between random vectors: 
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Cross-Covariance: E xy = E [(x — n x )(y — U. y ) T ] 
Uncorrelated: E[xy] = E[x\E[y] => E xy = 

Independent: p(x, y_) = p(x)p(y_) 

As before, independence implies uncorrelatedness, but not conversely. 
Finally, B ayes' rule applies: 

pwu) = ^f = m^m. (B .2i) 



B.1.4 Random Fields 

A random field [2, 62, 1 12, 335] x is a collection of random variables arranged on a 
lattice i?: 

x = \x. e & | i e n} . (B.22) 

In principle the lattice can be any (possibly irregular) collection of discrete points in 
any number of dimensions; however it is most convenient and intuitive to visualize 
the lattice as a rectangular, regular array of sites: 

Q = {(i,j) \l<i<n u l<j< n 2 } (B.23) 

in which case a random field is just a set of random pixels 

X = {xij\(i,j)en}. (B.24) 

A random field is (spatially) stationary if its statistics are only a function of offset, 
and not of position, for example that 

E[xijXi+sj+ K ] =E[xo i0 x SiK ] Vi,j. (B.25) 

Similarly, a time-dynamic random field 

X(t) = {x(t) itj \(i,j)ef2} (B.26) 

is temporally stationary if the statistics are only a function of temporal offset, and 
not of absolute time: 

E [x(t)ijx(s) aib ] = E [x(0)ijx(s - t) aih ] . (B.27) 

We therefore have the separate, distinct concepts of time stationarity and spatial sta- 
tionarity. It is possible for X(t) to have neither, one, or both forms of stationarity. It 
is similarly possible for a multidimensional field X to be stationary in one direction 
(e.g., along the columns) but not in another (the rows). 
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As with random variables or random vectors, any random field can, in principle, be 
completely characterized by its associated probability measure px(X). The detailed 
form ofp(-) depends on whether the alphabet \P of the elements Xij G \P is discrete, 
in which case px (X) denotes a probability distribution, or continuous, in which case 
px(X) denotes a probability density function. 

Because any random field X can be lexicographically reordered into a column vector 
x = [X]., all of the properties of random vectors and their associated covariances 
extend to random fields. 

In most cases, the distinct feature of random fields is their size. Consider, for ex- 
ample, a modestly sized image, in which m = ri2 = 256. In this case, X contains 
65 536 random elements, and the joint distribution p(-) or the covariance cov([X].) 
must explicitly characterize the joint statistics of 65 536 elements. 

Because the function p(-) is a cumbersome and computationally inefficient means of 
defining the statistics of a random field, a great part of the research into random fields 
involves the discovery or definition of implicit statistical forms which lead to effec- 
tive or faithful representations of the true statistics, while admitting computationally 
efficient algorithms. Chapter 5 examines this question of representation at length. 



B.2 Transformation of Random Vectors 

It is common to operate on a random vector with some function 

2Z = i(x), (B.28) 

or, written in component form, 

V% = fi(xi,...,x n ). (B.29) 

If /() is a linear function then 

U = l(x) = Ax + b (B.30) 

for some constant matrix A and vector 6; in this linear case the statistics of the trans- 
formation can be calculated in closed form: 



y (B.31) 

= E[Ax + b] = AE[x] + b = A^ x + b 

Ay=E[( U -Ll y )(lL-Uy) T ] 

= E [{Ax + b- A^ x - b)(Ax + b - A^ x - b) T ] 

= E[A(x-^)(x- i i x ) T A T ] (B.32) 

= AE [(x- n x )(x- n x ) T ] A T 

= AA T A T . 
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The shape of the PDF is generally distorted, even by linear transformations, except 
in one special case: linear transformations map normal distributions to normal distri- 
butions: 

x~N(&E) ^ A - + - , y _~M(Ay L + b,AEA T ). (B.33) 

In the case where the transformation is nonlinear the computation of the probability 
density is more difficult. Suppose we have random variables x, y, related as 

V = fix) (B.34) 

where / is continuous. The PDF of y is 

Pv (Y) = lim . (B.35) 

That is, we're interested in determining the probability of finding y in some small 
window about Y . But y will be near Y only if x is near a root f{Xi) — Y = 0. 
Because / is continuous, 

lim f(Xi + S) = Y + 5f'(Xi), (B.36) 

5— >0 



then given the q roots X\ , . . . , X q 

Pr(|y -y\<5) = ^Pr(|X, - x\ < S/f\X z )). (B.37) 



q 



where the number q of roots will generally be a function of Y. Finally, taking the 
limit as S — > 0, we find 

(Y)= PxjXj) Px(X q ) 

as the nonlinear transformation from x to y. 



B.3 Multivariate Gaussian Distribution 



The common assumption of the Gaussian distribution is motivated on a variety of 
counts: 

Common in the physical world due to the Central Limit Theorem, 

Distribution preservation under linear transformations, 

The equivalence of independence and uncorrelatedness, 

The equivalence of the MAP and Bayesian least-squares estimators, 

The linearity of the optimum Bayesian least-squares estimator. 
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The definition of the multivariate Gaussian is straightforward: 



p(x) = 



1 



exp 



(x — u) T 2J 1 (x — n) 



(2tt)^\U\ 1 / 2 
We briefly consider two special cases. 

Case 1 : The dimension n = 1: 

Setting n = 1 causes (B.39) to reduce to 

1 



(B.39) 



p(x) = 



(2tt) 1 2\U\ 1 / 2 



exp 



(x - v) 2 Z- 



(2tt)^o- 



exp 



1 I X — fJi 

2 { a 



(B.40) 
(B.41) 



where ZJ = a 2 . That is, the multivariate Gaussian (B.39) properly reduces to the 
usual univariate Gaussian (B.4). 

CASE 2: The covariance is diagonal: 

Therefore the covariance matrix has the form 



U = 



'ii 







(B.42) 



The diagonality of U implies a simple form for the matrix inverse and determinant, 
thus 



p(x) 



1 



n 1 

\\ (2^~ eXP 

n 



r exp 



n 



2=1 



Xi - /Hi 



1 / Xi fli 

2 



(B.43) 
(B.44) 
(B.45) 



That is, if x is a multivariate Gaussian, and if the elements of x are uncorrelated, 
then the elements of x are also independent. 
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Surface of p(x) 



Contours of p(x) 



Fig. B.l. A two-dimensional normal distribution is characterized by its two-by-two covariance 
matrix, left. By slicing through the distribution, middle, the contour of constant probability is 
seen to be an ellipse. The constant-probability contours are most easily seen in a contour plot, 
right, where the thick line shows the unit standard deviation contour. 
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A covariance matrix 



Z x =E[(x-u)(x-l±) T } 



(B.46) 



describes the second-order interrelationships of the elements of a random vector x. 
A single element of the covariance 



(£x)i,j = E[(Xi - fJii)(Xj - jJLj)] 



(B.47) 



is the correlation between X{ and Xj . Thus zero entries in E x imply a decorrelation 
of the corresponding two variables, and a diagonal covariance implies that all of the 
variables are mutually decorrelated. 

All covariances must satisfy four properties: 

E = E T Symmetry, 

£i,i >0 Non-negativity of diagonal elements, 



£ > 



Positive-semidefiniteness (Appendix A.4), and 



Xj > Non-negativity of eigenvalues (Appendix A.7.1). 



The cross-covariance 



Exy =E[(x- n x )(u ~ U y ) T ] 



(B.48) 



also allows us to describe the relationship between random vectors, however cross- 
covariances do not obey any of the symmetry, non-negativity, or positive-definiteness 
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Fig. B.2. A covariance describes an n-dimensional ellipsoid, with the eigenvectors pointing in 
the direction of the principal axes, and the eigenvalues describing the axis lengths. 



properties of covariances. However, it is true that U xy = Uy X and that a zero entry 
in ith column and jth row of U xy implies a decorrelation between yi and Xj . 

Returning to the multivariate normal distribution of (B.39), the equiprobability con- 
tour (the collection of points in space, all of which are equally likely) is defined by 



P(x) 



1 



(2^)f|i7| 1 /2 



^(x-^U-^x-m) 



constant, 



that is, that 



(x — u) T 2J 1 (x — ii)= constant. 



(B.49) 



(B.50) 



The set of points described by (B.50) forms an ellipse, as seen in Figure B.l. In gen- 
eral the behaviour of a covariance is sketched by drawing the unit standard deviation 
contour, where 

[x - M) T ^ _1 fe - M) = 1- (B.51) 

This unit-ellipse is centred on the mean /i, has axes pointing in the directions of the 
eigenvectors of £, and has semi-axis lengths equal to the square roots of the eigenval- 
ues of U. The relationships between a covariance and its associated eigendecompo- 
sition are summarized in Figure B.2. The matrix properties of covariances are further 
discussed in the context of eigendecompositions in Appendix A.7.1, specifically on 
page 399. Certain properties of cross-covariances are discussed in the context of the 
singular value decomposition in Appendix A.7.2. 

In general, for the bivariate case a covariance is written as 



U 



a b 
be 



a,c > \b\ < y/ac, 



(B.52) 
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Fig. B.3. Four examples of two-dimensional Normal distributions. In each case, the darkly 
banded contour shows the unit standard deviation distance, and the dashed rectangle plots the 
bounding box. 



where the inequality constraint on b ensures that the covariance remains positive- 
definite. The unit standard deviation ellipse associated with £ is perfectly inscribed 
in a rectangle, centred on the mean, of width 2^/a and height 2y/c 9 where the angle 
and eccentricity of the ellipse are controlled by the sign and magnitude of b. Four 
examples are plotted in Figure B.3. 



Image Processing 



Although this text is not about image processing per se, some familiarity with com- 
mon concepts in image processing, such as convolution or denoising, is very helpful. 
This appendix is only a brief list of concepts, and is in no way a comprehensive tu- 
torial on image processing, for which the interested reader is referred to any one of 
many excellent textbooks [36, 54, 143, 174, 210]. 

An image / is a set of values arranged on a rectangular grid. An image may be 
considered a two-dimensional function 

I(x,y) 

or as a matrix 

with the frustrating disadvantage that the spatial indexing of the two notational 
schemes are quite different from each other, as illustrated in Figure C.l. 

Any image stored in a computer is an approximate representation of a real-world 
phenomenon: 

Real-World Image: x, y, I(x,y) are all continuous-valued 

Computer-Stored Image: x,y,I(x,y) are all discretized. 

As this text concerns the computer processing of multidimensional data, we focus on 
the latter definition. 

For most images, I{x,y) is either a scalar, in which case the image is referred to as 
"grey scale," or J(x, y) is a vector, in which case we have a colour image in some 
colour space. By far the most common colour space is RGB, 



10, y) 



R(x,y) 
G{x,y) 

B(x,y) 



(C.l) 
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Usual Cartesian Coordinate System I(x,y) Usual Matrix Indexing System Uj 

Fig. C.l. Two coordinate systems are commonly used in image processing. 



such that each image pixel consists of red, green, and blue values. Many other colour 
spaces have been defined (HSV, YIQ, . . . ), but are not relevant to this text. Instead, 
in this text the value of an image I(x,y) can be any unknown or measured quantity, 
such as the temperature of the ocean, the strength of a radar return, or the rate of 
signal decay in an MRI. 



C.l Convolution 



In signal processing, any linear operation on a signal can be written in terms of what 
is known as a convolution [244] 

/CO °° 

s(r)h(t-r)dr s(ri)*h(ri)= ^ s{r)h{n - r), (C.2) 

-°° r=-oo 

where both the continuous-time and discrete-time versions are given, respectively, h 
is known as the impulse response, and characterizes the linear operation. 

Precisely the same is true in two (and higher) dimensions, such that 

CO CO 

I(x,y) = I(x,y)*H(x,y) = ^ ^ I(m,n)H(x - m,y - n) 



m= — oo n= — co 

CO CO 



(C.3) 



= 5Z X! I (x-m,y-n)H(m,n), 



m= — co n= — co 



where H, typically known as a convolution kernel, describes a pattern or a mask such 
that a single element in the result I is formed as a weighted sum of elements in /, 
where the kernel H consists of a matrix of weights. 
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Consider three illustrations: 

Image Blurring: 

In an optical system H, say consisting of one or more lenses, the point-spread- 
function H is the image that results from a point (impulsive) light source 

H = H(5(x,y)). 

If the system is linear, such that superposition applies, 

H(ah+(3I 2 ) = aH(h)+pH(l 2 ), (C.4) 

and stationary, meaning that a shift in the input leads to a corresponding shift in 
the output, 

I(x,y)=H(l(x,y)) => I(x- A x ,y - Ay) =H(l(x- A x ,y - Ay)), 

(C.5) 
then the system can be written as a convolution, with the point- spread-function 
(the "impulse response") as the convolution kernel: 

I = H(I) =H*I. (C.6) 

Two examples of blurring are shown in Figure C.2. There is no particular need for 
H to be isotropic, as is illustrated by the anisotropic example in the figure. 

Edge Detection: 

If an edge is defined as an abrupt change in image brightness, then a differencing 
operator can be used to reveal edges. 

Among the many edge detectors which have been proposed, two of the simplest are 
the Sobel operators S x , S y shown in the bottom panel of Figure C.2. Many other 
variations are possible, including diagonal differences (rather than just horizontal 
and vertical), and spatially-adaptive thresholds on the gradient. 

Image Filtering: 

In general, a convolution is just the filtering of an image. The blur and edge il- 
lustrations in Figure C.2 are examples of low-pass and high-pass filtering, respec- 
tively. 

Methods of filter design have been developed (see any of the image processing 
textbooks cited at the beginning of this appendix) which allow standard one- 
dimensional filters, such as notch-pass or band-pass filters, to be generalized to 
two dimensions. 

The convolution equations in (C.2) are very elegant in theory, but become consider- 
ably more complicated in practice because images are finite: they have boundaries. 
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Fig. C.2. Convolution is a powerful concept in image processing. An impulsive kernel is the 
identity operator, left, causing no change to the image. Other convolution kernels may cause 
blurring, top, or detect edges, bottom. 



We have a few different options, illustrated in Figure C.3: 



1. Compute the convolution only within the image, away from the boundary. This 
approach corresponds to the "Valid Region" in Figure C.3. 

2. Just truncate the convolution sum outside of the image, which implicitly assumes 
the image to be zero outside of its boundaries. This approach corresponds to 
keeping the dark boundary of I * H in Figure C.3. 
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Original Image I 



Convolved Image I * H 





Periodic Image I 



Circular Convolved Image I@H 
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Mirrored Image I 



Convolved Mirrored Image 



Fig. C.3. The behaviour of convolution is somewhat subtle at image boundaries. For regular 
convolution over a finite image, top, the image is implicitly treated as zero outside of its 
domain, leading to the boundary effect seen in / * if. The convolution is unaffected by the 
boundary only within the "Valid" region. Under circular convolution, middle, the image is 
implicitly treated as periodic, which can lead to boundary effects if the periodic assumption 
is a poor one (observe the darkening near the top of the circular convolution, due to wrap- 
around from the dark bottom). Mirroring the image at its boundaries frequently leads to better 
boundary behaviour. 
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3. Assume the image to be periodic, and therefore nonzero outside of its boundary, 
leading to what is known as circular convolution. If the image is not, in fact, 
periodic then spill-over effects may be visible, as can be seen in the top and 
bottom of I®H in Figure C.3. 

4. Mirror the image at its boundaries, rather than assume it to be periodic. This 
approach turns out to have a higher computational complexity than the simpler 
periodic assumption, but tends to lead to better results for real images. 



C.2 Image Transforms 

Images can be large, highly complex, with foregrounds and backgrounds and struc- 
tures on a variety of scales. In order to be able to analyze the information content of 
an image more simply, many transformations have been proposed for one purpose or 
another. The transformations may be colour or greyscale, linear or nonlinear, local 
or global, flat or hierarchical, a change of basis or a dimensionality reduction. 

Three particularly common transforms are shown in Figure C.4: 

The Fourier Transform [244] represents an image as a weighted sum of si- 
nusoids, analogous to the one-dimensional Fourier transform. The transform is 
global and assumes the image to be periodic, limiting the usefulness and appli- 
cability of this transformation when processing real images. However, the exis- 
tence of a very fast algorithm, the Fast Fourier Transform, makes the transform of 
great interest for certain problems in statistical image processing (see Sections 8.3 
and 11.2.1). 

The Fourier transform has a very close connection to circular convolution and 
linear systems. For any image I and impulse response H, 

¥¥T(I®H) = FFT(J) FFT(#) (C.7) 

That is, circular convolution corresponds to element-by-element multiplication in 
the Fourier domain. 

The Wavelet Transform [293] is a local, hierarchical transform, far more ef- 
fective than the Fourier transform in processing real images. The image is rep- 
resented as a weighted sum of shifted and rescaled versions of a single function 
(the wavelet). In many cases the wavelet transform is very sparse (most coeffi- 
cients are near zero), making the approach very successful in image compression 
and denoising. The wavelet transform can be used as a preconditioner for spatial 
problems, as discussed in Section 8.4.2. 

The Hough Transform [54] searches an image for lines. Whereas the Fourier 
and wavelet transforms are invertible, meaning that the image can be recovered 
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Fig. C.4. A great many image transformations have been proposed. Observe particularly the 
sparsity of the wavelet transform, and the diagonal band in the Fourier transform stemming 
from the camera motion tracking the bicyclist in the original image. The Hough transform was 
applied to the Vertical- Scale 1 image from the wavelet transform. 



from the transform coefficients, the Hough transform is a method of image analy- 
sis, not one of representation. For every straight line, parametrized by its slope and 
intercept, the Hough transform sums the image along that line, leading to peaks in 
the transform corresponding to detected lines in the image. 

Clearly the Hough transform can be generalized to other parametrized shapes, 
such as circles or parabolae. 



Other important transformations, described in image processing texts, include Gabor 
filters, Laplacian pyramids, difference of Gaussians, and the Cosine transform. 
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C.3 Image Operations 



Many algorithms and methods have been developed for image processing. A very 
short, incomplete list follows: 

DENOISING: Given an image corrupted by noise, estimate a noise-reduced image. 
The noise may be possibly additive or multiplicative, possibly Gaussian or non- 
Gaussian, possibly white or correlated (Figure C.5 and Figure 2.3). 

In-Painting : A generalization of the denoising problem, given an image with miss- 
ing pieces, develop a way to extrapolate the observed behaviour of the image into 
the missing parts (Figure 2.3). 

Deblurring OR Deconvolution: Given an image corrupted by some point- 
spread-function, invert the point- spread convolution to obtain the original image. 
The point- spread-function may be known or, if unknown, the problem is referred 
to as blind deconvolution (Figure 2.3). 

Resolution Enhancement or Superresolution: Given one or more images 
at some resolution, estimate the image at a higher resolution (Application 1 1 and 
Section 8.4.3). 

Edge Detection: Find the edges / lines in an image, often an initial operation 
to simplify image analysis for further image segmentation, feature detection, or 
object recognition (Figure C.6). 

Classification: Classify each image pixel into one of k possible predetermined 
behaviours. A great many satellite remote- sensing problems fall into this cate- 
gory, distinguishing urban and rural, ice and water, forest and agricultural etc. 
(Section 7.1.3 and Application 6). 

Segmentation: Essentially the blind version of image classification, dividing an 
image into non-overlapping regions of homogeneous behaviour, but where the 
number of regions and their meaning are unknown ahead of time (Figure C.7 and 
Application 7). 

Compression: Find a transformation whereby an image can be represented, and 
subsequently reconstructed, from as few coefficients as possible (Section 8.2.1). 

Feature Detection: Extract significant points of interest from an image, often 
corners. Preferably the features are robust, invariant to rotation, scale, or changes 
in illumination. 

Object Recognition: Similar to classification, produce a map which identifies 
objects of interest in an image (car, person, . . . ) as distinct from the background. 
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Fig. C.5. A small sampling of the many denoising methods which have been developed. A 
regular convolution leads to excessive smoothing, however the Wiener filter is least-squares 
optimal for Gaussian noise, and the nonlinear median filter performs very well for the non- 
Gaussian salt-and-pepper noise. 



Watermarking: Via subtle changes in certain image features, place into the im- 
age a digital signature or watermark, in a way that is robust to image rotation, 
rescaling, cropping etc. 

Tracking and Registration: Associate the objects / features in one image with 
those in one or more other images. 

Of these operations, three have a somewhat greater relevance to this text: 

1. Image denoising is an inverse problem, as discussed in Chapter 2. Almost all 
noise-reduction methods proceed on the basis of assuming that neighbouring pix- 
els in an image are closely related and can be averaged, such as using a simple 
convolutional blur, as shown in Figure C.5. 



Convolution is indiscriminate, however, blurring all parts of an image equally. 
More refined approaches are adaptive, blurring more broadly in smooth regions, 
and more locally in regions having greater detail. Nonlinear methods, such as a 
median filter, attempt to preserve edge structure by not averaging across an edge. 

Some of the more powerful denoising approaches (see Problem 8.5) involve non- 
linear processing in the sparse wavelet domain. 
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Fig. C.6. The Sobel edge detector is based on the simple convolutions in Figure C.2. The more 
sophisticated Canny detector is among the most widely-used approaches. 



2. The simple Sobel convolutional operators demonstrated in Figure C.2 are really 
a very primitive approach to edge detection. A well-established, more reliable 
approach to edge detection is the Canny detector, shown in Figure C.6. An alter- 
native approach is the Zero-Cross method, based on finding zero-crossings in the 
second derivative of an image. 

3. Image segmentation seeks to divide an image into homogeneous regions. Essen- 
tially, image segmentation is a dual to edge detection, in the sense that the outlines 
of the segmented regions offer one possible edge map for an image. The resulting 
segmentation is a hidden label map (per Chapter 7). 

The simplest approach to segmentation is a global, single threshold £, such that 
the image is divided as shown in Figure C.7. Clearly non-global variations can be 
proposed, such that threshold ((x,y) varies with location. There are exception- 
ally many approaches to segmentation, including clustering methods based on K- 
means [91], region growing / region merging methods, active contours (snakes), 
level-set methods, graph-based methods, scale-space methods, and watershed. 




Original Image 



Otsu Global Thresholding Watershed Segmentation 



Fig. C.7. An image can be segmented into pieces by binarization with a single global thresh- 
old, such as using Otsu's method. Most complex images do not threshold well with a single 
threshold, so local methods, such as watershed, are commonly used. 



Reference Summary 



Long bibliographic lists can be difficult to use, therefore this section attempts to 
provide some thematic structure to assist the reader in finding meaningful references 
and further reading. 

Because textbooks and journal papers fulfill rather different purposes in terms of 
depth, breadth, accessibility, and on-line availability, the textbook and paper refer- 
ences are listed separately. 
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Inverse Problems: 
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15,30,32,36,94,95,107,119,159-161,167,178,185,201,207,209,251,253, 
256, 267, 288, 296, 298-300, 303, 305, 307, 349 

Medical Imaging: 

36,51,183,218,220,319 

Other Applications: 

113,148,208,236,291,332 



References 



1. J. Abbott, M. Bronstein, T. Mulders, "Fast deterministic computation of determinants of 
dense matrices," Int. Conf. Symbolic and Algebraic Computation, Vancouver, pp. 197- 
204, 1999 

2. R. Adler, The Geometry of Random Fields, Wiley, 1981 

3. H. Akaike, "Markovian representation of stochastic processes by canonical variables," 
SIAMJ. Control (13) #1, pp.162-173, 1975 

4. A. Albert, Regression and the Moore-Penrose Pseudoinverse, Academic Press, 1972 

5. S. Alexander, P. Fieguth, M. Ioannidis, E. Vrscay, "Hierarchical annealing for synthesis 
of binary porous media images," Mathematical Geosciences (41) #4, pp.357-378, 2009 

6. T. Anderson, The Statistical Analysis of Time Series, Wiley, 1971 

7. D. Angwin, H. Kaufman, Digital Image Restoration Springer, 1991 

8. B. Anderson, J. Moore, Optimal Filtering, Prentice-Hall, 1979 

9. G. Arce, "Multistage Order Statistic filters for image sequence processing," IEEE Trans. 
Acoustics, Speech, Signal Processing (39) #5, pp. 1147-1 163, 1991 

10. A. Asif, J. Moura, "Data assimilation in large time varying multidimensional fields," 
IEEE Trans. Image Processing (8) #11, pp.1593-1607, 1999 

11. A. Asif, J. F. Moura, "Block matrices with L-Block banded inverse: inversion algo- 
rithms," IEEE Trans. Signal Processing (53) #2, pp.630-642, 2005 

12. R. Aster, B. Borchers, C. Thurber, Parameter Estimation and Inverse Problems, Aca- 
demic Press, 2005 

13. Z. Azimifar, P. Fieguth, E. Jernigan, "Towards random field modeling of wavelet statis- 
tics," ICIP'02, Rochester, 2002 

14. S. Baker, T. Kanade, "Limits on super-resolution and how to break them," IEEE CVPR, 
2000 

15. Ballard, Hinton, Sejnowski, "Parallel computation in vision problems," Nature (306) 
#5938, pp.21-26, 1983 

16. F. Barbaresco, S. Bonney, J. Lambert, B. Monnier, "Motion-based segmentation and 
tracking of dynamic radar clutter," ICIP'96 (III), pp.923-926, 1996 

17. Y. Bar Shalom, T. Fortmann, Tracking and Data Association, Academic Press, 1988 

18. M. Basseville, A. Benveniste, K. Chou, S. Golden, R. Nikoukhah, A. Willsky, "Model- 
ing and estimations of multiresolution stochastic processes," IEEE Trans. Information 
Theory (38) #2, pp.766-784, 1992 



437 



438 References 

19. K. Baum, T. Petrie, G. Soules, N. Weiss, "A maximization technique occurring in the 
statistical analysis of probabilistic functions of Markov chains," The Annals of Mathe- 
matical Statistics (41) #1, pp. 164-171, 1970 

20. M. Bello, A. Willsky, B. Levy, "Construction and applications of discrete-time smooth- 
ing error models," Int. J. Control (50) #1, pp.203-223, 1989 

21. D. Benboudjema, W. Pieczynski, "Unsupervised statistical segmentation of nonstation- 
ary images using triplet Markov fields," IEEE Trans. PAMI (29) #8, pp. 1367-1378, 2007 

22. C. Benedekand, T. Sziranyiand, Z. Kato, J. Zerubia, "A multi-layer mrf model for object- 
motion detection in unregistered airborne image-pairs," IEEE ICIP (VI), pp. 141-144, 
2007 

23. J. Beran, Statistics for Long-Memory Processes, Chapman & Hall, 1994 

24. M. Bertero, P. Boccacci, Introduction to Inverse Problems in Imaging, Taylor & Francis, 
1998 

25. J. Besag, "Spatial interaction and the statistical analysis of lattice systems," /. Royal 
Society, Series E (36), pp. 192-236, 1974 

26. J. Besag, "On the statistical analysis of dirty pictures," /. Royal Statistatical Society B 
(48) #3, pp.256-302, 1986 

27. M. Bertero, T. Poggio, V. Torre, "Ill-posed problems in early vision," Proc. IEEE (76) 
#8, pp.869-889, 1988 

28. G. Bierman, Factorization Methods for Discrete Sequential Estimation, Academic Press, 
1977 

29. C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995 

30. A. Blake, A. Yuille (Eds.), Active Vision, MIT Press, 1993 

31. A. Blake, R. Curwen, A. Zisserman, "A framework for spatiotemporal control in the 
tracking of visual contours," Int. J. Computer Vision (11) #2, pp. 127-145, 1993 

32. A. Blake, Active Contours, Springer, 1998 

33. S. Blostein, T. Huang, "Detecting small, moving objects in image sequences using se- 
quential hypothesis testing," IEEE Signal Processing (39) #7, pp. 161 1-29, 1991 

34. L. Blum, F. Cucker, M. Shub, S. Smale, Complexity and Real Computation, Springer, 
1997 

35. C. A. Bouman and M. Shapiro, "A multiscale random field model for Bayesian image 
segmentation," IEEE Trans. Image Processing, (3) #2, pp. 162-177, March 1994 

36. A. Bovik (Ed.), Handbook of Image and Video Processing, 2nd ed., Academic Press, 
2005 

37. G. Box, G. Jenkins, G. Reinsel, Time Series Analysis - Forecasting and Control, 
Prentice-Hall, 1994, 

38. A. Brandt, "Multi-level adaptive solutions to boundary-value problems," Mathematics 
of Computation (31) #138, pp.333-390, 1977 

39. J. Bramble, J. Pasciak, A. Schatz, "The construction of preconditioners for elliptic prob- 
lems by substructuring III," Mathematics of Computation (51) #184, pp.4 15-430, 1988 

40. J. Bramble, Multigrid Methods, Wiley, 1993 

41. J. Brailean, R. Kleihorst, S. Efstratiadis, A. Katsaggelos, A. Lagendijk, "Noise reduction 
filters for dynamic image sequences: A review," Proc. IEEE (83) #9, pp. 1272-1292, 
1995 

42. P. Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, 
Springer, 1999 

43. W Briggs, A Multigrid Tutorial, SIAM, 1987 

44. W Briggs, "Wavelets and multigrid," SIAM J. Scientific Computing (14), 1993 

45. P. Brodatz, Textures: A Photographic Album for Artists and Designers, Dover, 1966 



References 439 

46. M. Brookes, The Matrix Reference Manual, Imperial College, 2005 

47. R. Bucy, P. Joseph, Filtering for Stochastic Processes, Wiley, 1968 

48. C. Byrnes, S. Gusev, A. Lindquist, "A convex optimization approach to the rational 
covariance extension problem," SI AM J. Control and Optimization (37) #1, pp.21 1-229, 
1999 

49. S. Campbell, C. Meyer, Generalized Inverses of Linear Transformations, Dover, 1991 

50. W. Campaigne, P. Fieguth, S. Alexander, "Frozen- state hierarchical annealing," 
ICIAR'06 (Springer LNCS 4141), 2006 

51. E. Candes, J. Romberg, T. Tao, "Robust uncertainty principles: Exact signal reconstruc- 
tion from highly incomplete frequency information," IEEE Trans. Information Theory 
(52) #2, pp.489-509, 2006 

52. O. Cappe, E. Moulines, T. Ryden, Inference in Hidden Markov Models, Springer, 2005 

53. G. Carballo, P. Fieguth, "Multiresolution network flow phase unwrapping," IEEE Trans. 
Geoscience and Remote Sensing (40) #8, pp. 1695-1708, 2002 

54. K. Castleman, Digital Image Processing, Prentice-Hall, 1996 

55. D. Chandler, Introduction to Modern Statistical Mechanics, Oxford University Press, 
1987 

56. S. Chang, B. Yu, M. Vetterli, "Adaptive wavelet thresholding for image denoising and 
compression," IEEE Trans. Image Processing (9) #9, pp. 1532-1546, 2000 

57. S. Chang, Y. Bin, M. Vetterli, "Spatially adaptive wavelet thresholding with context 
modeling for image denoising," IEEE Trans. Image Processing (9) #9, pp. 1532-1546, 
2000 

58. P. Charbonnier, L. Blanc-Feraud, M. Barlaud, "Noisy image restoration using multires- 
olution Markov random fields," /. Visual Communication and Image Representation (3) 
#4, pp.338-346, 1992 

59. R. Chellappa, R. Kashyap, "Digital image restoration using spatial interaction models," 
IEEE Trans. Acoustics, Speech, Signal Processing (30) #3, pp.461-472, 1982 

60. R. Chellappa and S. Chatterjee, "Classification of textures using Gaussian Markov ran- 
dom fields." IEEE Trans. Acoustics, Speech, Signal Processing, (33), pp.959-963, 1985 

61. R. Chellappa, "Two-dimensional discrete Gaussian Markov random field models for 
image processing," Progress in Pattern Recognition (2), pp.79-112, 1985 

62. R. Chellappa, A. Jain (Eds.), Markov Random Fields - Theory and Application, Aca- 
demic Press, 1993 

63. K. Chen, Matrix Preconditioning Techniques and Applications, Cambridge University 
Press, 2005 

64. M. Chen, Q. Shao, J. Ibrahim, Monte Carlo Methods in Bayesian Computation, Springer 
Series in Statistics, Springer, 2000 

65. T. Chin, W. C. Karl, A. Willsky, "Sequential filtering for multi-frame visual reconstruc- 
tion," Signal Processing (28), pp.31 1-333, 1992 

66. T. M. Chin, W. C. Karl, A. S. Willsky, "A distributed and iterative method for square 
root filtering in space-time estimation," Automatica (31) #1, pp.67-82, 1995 

67. T. M. Chin, A. J. Mariano, and E. P. Chassignet, "Spatial regression and multiscale ap- 
proximations for sequential data assimilation in ocean models," /. Geophysical Research 
(104), pp.7991-8014, 1999 

68. K. Chou, A Stochastic Modeling Approach to Multiscale Signal Processing, PhD Thesis, 
Dept. EECS, Massachusetts Institute of Technology, 1991 

69. K. Chou, A. Willsky, A. Benveniste, "Multiscale recursive estimation, data fusion, and 
regularization," IEEE Trans. Automatic Control (39) #3, pp.464-478, 1994 

70. S. Clippingdale, R. Wilson, "Least- squares image estimation on a multiresolution pyra- 
mid," ICASSP'89, Glasgow, pp.1409-1412, 1989 



440 References 

71. J. Coleman, Gaussian Spacetime Models: Markov Field Properties, PhD Thesis, Uni- 
versity of California at Davis, 1995 

72. M. Costantini, "A novel phase unwrapping method based on network programming," 
IEEE Trans. Geoscience and Remote Sensing (36) #3, pp.813-821, 1998 

73. R. Courant, D. Hilbert, Methods of Mathematical Physics VI, Interscience Publishers, 
1953 

74. I. Cox, S. Hingorani, "An efficient implementation of Reid's multiple hypothesis track- 
ing algorithm and its evaluation for the purpose of visual tracking," IEEE Trans. PAMI 
(18)#2,pp.l38-150, 1996 

75. N. Cressie, "The origins of kriging," Math. Geol. (22) #3, pp.239-252, 1990 

76. N. Cressie, Statistics for Spatial Data, Wiley, 1993 

77. M. Crouse, R. Nowak, R. Baraniuk, "Wavelet-based statistical signal processing using 
hidden Markov models," IEEE Trans. Signal Processing (46) #4, pp.886-902, 1998 

78. Dahlquist, Bjorck, Numerical Methods, Prentice-Hall, 1974 

79. W Dahmen, A. Kunoth, "Multilevel preconditioning," Numerische Mathematik (63) #3, 
pp.3 15-344, 1992 

80. R. Daley, Atmospheric Data Analysis, Cambridge University Press, 1991 

81. M. Daniel, A. Willsky, "A multiresolution methodology for signal-level fusion and data 
assimilation with applications to remote sensing," Proc. IEEE, (85), pp. 164-180, 1997 

82. M. Daniel, A. Willsky, "The modeling and estimation of statistically self-similar pro- 
cesses in a multiresolution framework," IEEE Trans. Information Theory (45) #3 , 
pp.955-970, 1999 

83. P. Davis, Circulant Matrices, Wiley-Interscience, 1979 

84. J. Davis, A. Bobick, "The representation and recognition of human movement using 
temporal templates," CVPR, pp.928-934, Puerto Rico, 1997 

85. A. Dempster, N. Laird, D. Rubin, "Maximum likelihood estimation from incomplete 
data," /. Royal Statistical Society (B) (39) #1, pp.1-38, 1977 

86. H. Derin, P. Kelly, "Discrete-index Markov-type random processes," Proc. IEEE (77) 
#10,pp.l485-1510, 1989 

87. D. Donoho, I. Johnstone, "Adapting to unknown smoothness via wavelet shrinkage," /. 
Am. Statistical Assocation (90), pp. 1200-1224, 1995 

88. A. Doucet, N. de Freitas, N. Gordon (Eds.), Sequential Monte Carlo Methods in Prac- 
tice, Springer, 2001 

89. J. Driscoll, D. Healy, "Computing Fourier transforms and convolutions on the 2-Sphere," 
Adv. inAppl. Math. (15), pp.202-250, 1994 

90. I. Drori, D. Cohen-Or, H. Yeshurun, "Fragment-based image completion," ACM Trans. 
Graphics (22) #3, pp.303-3 12, 2003 

91. R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, 2001 

92. D. Dudgeon, R. Mersereau, Multidimensional Digital Signal Pocessing, Prentice-Hall, 
1984 

93. B. Efron, G. Gong, "A leisurely look at the bootstrap, the jacknife, and cross-validation," 
The American Statistician, 1983 

94. A. Efros, T. Leung, "Texture synthesis by non-parametric sampling," IEEE ICCV, 1999 

95. A. Eleftheriadis, A. Jacquin, "Automatic face location detection and tracking for model- 
assisted coding of video teleconference sequences at low bit rates," Signal Processing - 
Image Communication (7) #3, pp.23 1-248, 1995 

96. R. Elliot, L. Aggoun, J. Moore, Hidden Markov Models: Estimation and Control, 3rd 
ed., Springer, 2008 

97. R. Eubank, A Kalman Filter Primer, CRC Press, 2006 



References 441 

98. D. Evans, Preconditioning Methods: Analysis and Application, Gordon & Breach, 1983 

99. Evans, Hastings, and Peacock, Statistical Distributions, Institute of Physics Publishers, 
2001 

100. G. Evensen, "Sequential data assimilation with nonlinear quasi-geostrophic model us- 
ing Monte Carlo methods to forecast error statistics," /. Geophysical Research 99 (C5), 
pp. 143-162, 1994 

101. G. Evensen, Data Assimilation: The Ensemble Kalman Filter, Springer, 2007 

102. E. Fabre, "New fast smoothers for multiscale systems," IEEE Trans. Signal Processing 
(44)#8,pp.l893-1911, 1996 

103. B. Farrell, P. Ioannou, "State estimation using a reduced-order Kalman filter," /. Atmo- 
spheric Sciences (58), pp.3666-3680, 2001 

104. S. Farsiu, M. Robinson, M. Elad, P. Milanfar, "Fast and robust multiframe super resolu- 
tion," IEEE Trans. Image Processing (13) #10, pp. 1327-1344, 2004 

105. P. Fieguth, W. Karl, A. Willsky, C. Wunsch, "Multiresolution optimal interpolation and 
statistical analysis of TOPEX/POSEIDON satellite altimetry," IEEE Trans. Geoscience 
and Remote Sensing (33) #2, pp.280-292, 1995 

106. P. W. Fieguth, A. S. Willsky, "Fractal estimation using models on multiscale trees," IEEE 
Trans. Signal Processing (44) #5, pp. 1297-1300, 1996 

107. P. Fieguth, Demetri Terzopoulos, "Color-based tracking of heads and other mobile ob- 
jects at video frame rates," CVPR'97, pp.21-28, Puerto Rico, 1997 

108. P. Fieguth, W. Karl, A. Willsky, "Efficient multiresolution counterparts to variational 
methods for surface reconstruction," Computer Vision & Image Understanding (70) #2, 
pp. 157-176, 1998 

109. P. Fieguth, D. Menemenlis, T. Ho, A. Willsky, C. Wunsch, "Mapping Mediterranean 
altimeter data with a multiresolution optimal interpolation algorithm," /. Atmospheric 
and Oceanic Technology (15), pp. 535-546, 1998 

110. P. Fieguth, "Multiply-rooted multiscale models for large-scale estimation," IEEE Image 
Processing (10) #11, pp.1676-1686, 2001 

111. P. Fieguth, D. Menemenlis, I. Fukumori, "Mapping and pseudo-inverse algorithms for 
ocean data assimilation," IEEE Trans. Geoscience and Remote Sensing (41) #1, pp. 43- 
51, 2003 

112. P. Fieguth, J. Zhang, "Random field models" in A. Bovik (Ed.), Handbook of Image and 
Video Processing , Academic Press, p.361-376, 2005 

113. W Fieguth, Multi-Input Quasi-Linearization, Ph.D. Thesis, Dept. of Electrical Engi- 
neering, University of New Brunswick, 1967 

114. M. Figueiredo, R. Nowak, "Wavelet-based image estimation: An empirical Bayes ap- 
proach using Jeffreys' noninformative prior," IEEE Trans. Image Processing (10) #9, 
pp.1322-1331,2001 

115. M. Figueiredo, R. Nowak, "An EM algorithm for wavelet-based image restoration," 
IEEE Trans. Image Processing (12) #8, pp.906-916, 2003 

116. S. Fine, Y. Singer, N. Tishby, "The hierarchical hidden markov model: analysis and 
applications," Machine Learning (32), pp.41-62, 1998 

117. P. Flandrin, "Wavelet analysis and synthesis of fractional Brownian motion," IEEE 
Trans. Information Theory (38) #2, pp.910-917, 1992 

118. R. Franktot, R. Chellappa, "A method for enforcing integrability in shape from shading 
algorithms," IEEE Trans. RAMI (10) #4, pp.439-451, 1989 

119. W Freeman, T. Jones, E. Pasztor, "Example-based super-resolution," IEEE Computer 
Graphics and Applications , pp. 56-65, 2002 



442 References 

120. L. Fu, E. Christensen, C. Yamarone, M. Lefebvre, Y. Menard, M. Dorrer, P. Es- 
cudier, "TOPEX/POSEIDON mission overview," /. Geophysical Research (99) #C12, 
pp.24369-24381, 1994 

121. I. Fukumori, P. Malanotte-Rizzoli, "An approximate Kalman filter for ocean data as- 
similation; an example with an idealized Gulf Stream model," /. Geophysical Research, 
1994 

122. K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1990 

123. P. Gaspar, C. Wunsch, "Estimates from altimeter data of baryotropic Rossby waves in 
the northwestern Atlantic ocean," /. Physical Oceanography (19) #12, pp. 1821-1844, 
1989 

124. D. Geiger, A. Gupta, L. Costa, J. Vlontzos, "Dynamic programming for detecting, track- 
ing, and matching deformable contours," IEEE Trans. PAMI (17) #3, pp.294-302, 1995 

125. A. Gelb, Applied Optimal Estimation, MIT Press, 2002 

126. S. Gelfand, S. Mitter, "On sampling methods and annealing algorithms" in R. Chellappa, 
A. Jain (Eds.), Markov Random Fields - Theory and Application, pp.499-515, 1993 

127. S. Geman, D. Geman, "Stochastic relaxation, Gibbs distributions, and the Bayesian 
restoration of images," IEEE Trans. PAMI (6) #6, pp.721-741, 1984 

128. D. Geman, J. Jedynak, "An active testing model for tracking roads in satellite images," 
IEEE Trans. PAMI (18) #1, pp.1-14, 1996 

129. D. Geman, "Random fields and inverse problems in imaging," Lecture Notes in Mathe- 
matics (1427), Springer, pp. 117-193, 1991 

130. A. George, "Nested dissection of a regular finite element mesh," SIAM J. Numerical 
Analysis, pp.345-363, 1973 

131. A. George, J. Liu, Computer Solution of Large Sparse Positive Definite Systems, 
Prentice-Hall, 1981 

132. A. George, M. Heath, J. Liu, E. Ng, "Sparse Cholesky factorizatoin on a local-memory 
multiprocessor," Faculty of Mathematics Technical Report CS-86-02, University of Wa- 
terloo, 1986 

133. A. George, J. Gilbert, J. Liu (Eds), Graph Theory and Sparse Matrix Computation, 
Springer, 1993 

134. D. Ghiglia, M. Pritt, Two-Dimensional Phase Unwrapping, Wiley, 1998 

135. M. Ghil, P. Malanotti-Rizzoli, "Data assimilation in meteorology and oceanography," 
Advances in Geophysics (33), pp. 141-266, 1991 

136. B. Gidas, "A renormalization group approach to image processing problems," IEEE 
Trans. PAMI (11) #2, pp.164-180, 1989 

137. W Gilks, S. Richardson, D. Spiegelhalter (Eds), Markov Chain Monte Carlo in Practice, 
Chapman & Hall, 1996 

138. R. Goldstein, H. Zebker, C. Werner, "Satellite radar interferometry — two-dimensional 
phase unwrapping," Radio Science (23) #4, pp.7 13-720, 1988 

139. G. Golub, M. Heath, G. Wahba, "Generalized cross validation," Technometries (21), 
p.215, 1979 

140. G. Golub, U. von Matt, "Generalized cross-validation for large scale problems," /. Com- 
putational and Graphical Statistics (6) #1, pp. 1-34, 1997 

141. G. Golub, C. Van Loan, Matrix Computations, Johns Hopkins University Press, 1996 

142. G. Golub, D. O'Leary, "Some history of the conjugate gradient and Lanczos algorithms," 
SIAM Review (31), pp.50-102, 1989 

143. R. Gonzalez, R. Woods, Digital Image Processing (3rd ed.), Prentice-Hall, 2007 

144. J. Goodman, A. Sokal, "Multigrid Monte Carlo method, conceptual foundations," Phys- 
ical Review D (40) #6, pp.2035-2071, 1989 



References 443 

145. N. Gordon, D. Salmond, A. Smith, "Novel approach to nonlinear / nonGaussian 
Bayesian state estimation," IEE Procedings (F140), pp. 107-1 13, 1993 

146. C. Graffigne, F. Heitz, P. Perez, F. Prlteux, M. Sigelle, J. Zerubia, "Hierarchical Markov 
random field models applied to image analysis: a review," SPIE (2568), 1995 

147. P. Green, "Reversible jump Markov chain Monte Carlo computation and Bayesian model 
determination," Biometrika (82) #4, p.711, 1995 

148. L. Greengard, V. Rokhlin, "A fast algorithm for particle simulations," /. Computational 
Physics (73), pp.325-348, 1987 

149. H. Greenspan, C. Anderson, S. Akber, "Image enhancement by nonlinear extrapolation 
in frequency space," IEEE Trans. Image Processing (9) #6, pp. 1035-1047, 2000 

150. U. Grenander, Elements of pattern theory, Johns Hopkins University Press, 1996 

151. M. Grewal, A. Andrews, Kalman Filtering: Theory and Practice, Prentice-Hall, 1993 

152. W. Hackbusch, Multi-Grid Methods and Applications, Springer, 1985 

153. W. Hackbusch, Iterative Solution of Large Sparse Systems of Equations, Springer, 1994 

154. J. Hadamard, Lectures on the Cauchy Problem in Linear Partial Differential Equations, 
Yale University Press, 1923 

155. J. Handschin, "Monte Carlo techniques for prediction and filtering of non-linear stochas- 
tic processes," Automatica (6), pp.555-563, 1970 

156. M. Hayes, Statistical Digital Signal Processing and Modeling, Wiley, 1996 

157. X. He, R. Zemel, M. Carreira-Perpina, "Multiscale conditional random fields for image 
labeling," IEEE CVPR, 2004 

158. W. Heiskanen, H. Moritz, Physical Geodesy, W.H. Freemand & Co., 1967 

159. B. Horn, M. Brooks, "Integrability of Surface Gradients," MIT A.I. Memo 813, 1985 

160. B. Horn, Robot Vision, MIT Press, 1986 

161. B. Horn, "Height and gradient from shading," Int. J. Computer Vision (5) #1, pp.37-46, 
1990 

162. R. Horn, C. Johnson, Matrix Analysis, Cambridge University Press, 1990 

163. R. Horn, C. Johnson, Topics in Matrix Analysis, Cambridge University Press, 1994 

164. P. Houtekamer, H. L. Mitchell, "Data assimilation using an ensemble Kalman filter tech- 
nique," Monthly Weather Review (126), pp.796-811, 1998 

165. A. Hyvarinen, E. Oja, "Independent component analysis: algorithms and applications," 
Neural Networks (13) #4-5, pp.41 1-430, 2000 

166. A. Hyvarinen, J. Karhunen, E. Oja, Independent Component Analysis, Wiley, 2001 

167. Ikeuchi, B. Horn, "Numerical shape from shading and occluding boundaries," Artificial 
Intelligence (17) #1-3, pp.141-184, 1981 

168. I. Ipsen, C. Meyer, "The idea behind Krylov methods," American Mathematical Monthly 
(105) #10, pp.889-899, 1998 

169. W Irving, P. Fieguth, A. Willsky, "An overlapping tree approach to multiscale stochastic 
modeling and estimation," IEEE Trans. Image Processing (6) #11, pp. 1517-1529, 1997 

170. M. Isard, A. Blake, "CONDENSATION - conditional density propagation for visual 
tracking," Int. J. Computer Vision (29), pp.5-28, 1998 

171. E. Ising, "Beitrag zur Theorie des Ferromagnetismus," Z. Phys. (31), pp.253-258, 1925 

172. V. Ivanov, "On linear problems which are not well-posed," Soviet Math. Dokl. (3), 
pp.981-983, 1962 

173. V. Ivanov, "The approximate solution of operator equations of the first kind," USSR 
Copm. Math. Phys (6), pp. 197-205, 1966 

174. A. Jain, Fundamentals of Digital Image Processing, Prentice-Hall, 1989 

175. A. Jain, R. Duin, J. Mao, "Statistical pattern recognition: a review," IEEE Trans. PAMI 
(22)#l,pp.4-37, 2000 



444 References 

176. M. Jamieson, P. Fieguth, L. Lee, "Parametric contour estimation by simulated anneal- 
ing," IEEEICIP, Spain, 2003 

177. F. Jensen, Bayesian Networks and Decision Graphs, Springer, 2001 

178. F. Jin, P. Fieguth, L. Winger, "Wavelet video denoising with regularized multiresolution 
motion estimation," EURASIP J. Applied Signal Processing #72705, 2006 

179. I. Jolliffe, Principal Components Analysis (2nd ed.), Springer, 2002 

180. M. Jordan, Learning in graphical models, Kluwer Academic Publishers, 1998 

181. S. J. Julier, J. K. Uhlmann, "A new extension of the Kalman filter to nonlinear systems," 
Proc. AeroSense, 1997 

182. R. Kalman, "Contributions to the theory of optimal control," Bol. Soc. Mat. Mexicana 
(5), pp. 102-1 19, 1960 

183. A. Kak, M. Slaney, Principles of Computerized Tomographic Imaging, IEEE, 1999 

184. D. Kandel, E. Domany, "General cluster Monte Carlo dynamics," Physical Review B 
(43) #10, pp.8539-8548, 1991 

185. M. Kass, A. Witkin, D. Terzopoulos, "Snakes: active contour models," Int. J. Computer 
Vision (1) #4, pp.321-331, 1988 

186. Z. Kato, M. Berthod, J. Zerubia, "A hierarchical Markov random field model and mul- 
titemperature annealing for parallel image classification," Graphical Models and Image 
Processing (58) #1, pp. 18-37, 1996 

187. W Kaula, Theory of Satellite Geodesy, Blaisdell Publishing Co., 1966 

188. H. Kaufman, J. Woods, M. Tekalp, S. Dravida, "Estimation and identification of two- 
dimensional images," IEEE Trans. Automatic Control (AC-28), pp.745-756, 1983 

189. D. Keren, M. Werman, "Probabilistic analysis of regularization," IEEE Trans. PAMI (15) 
#10, pp.982-995, 1993 

190. M. Khalil, P. Wesseling, "Vertex-centered and cell-centered multigrid for interface prob- 
lems," Journal of Computational Physics (98) #1, pp. 1-10, 1992 

191. F. Khellah, P. Fieguth, J. Murray, M. Allen, "Statistical processing of large image se- 
quences," IEEE Trans. Image Processing (14) #1, pp.80-93, 2005 

192. J. Kim, J. Woods, "Spatio-temporal adaptive 3D Kalman filter for video," IEEE Trans. 
Image Processing (6) #3, p.414, 1997 

193. J. Kim, R. Zabih, "Factorial Markov random fields," ECCV, pp.321-334, 2002 

194. G. Kitagawa, "Monte Carlo filter and smoother for non-Gaussian non-linear state space 
models," /. Computational and Graphical Statistics (5) #1, pp. 1-25, 1996 

195. B. Kosko (Ed) Neural Networks for Signal Processing, Prentice-Hall, pp. 37-61, 1992 

196. S. Kumar, M. Hebert, "Discriminative random fields: a discriminative framework for 
contextual interaction in classification," Proc. IEEE ICCV (2), pp. 1150-1 157, 2003 

197. J.M. Laferte, P. Perez, F. Heitz, "Discrete Markov image modeling and inference on the 
quadtree," IEEE Trans. Image Processing (9) #3, pp. 390-404, 2000 

198. J. Lafferty, A. McCallum, F. Pereira, "Conditional random fields: Probabilistic mod- 
els for segmenting and labeling sequence data," Proc. Int. Conf. on Machine Learning 
(ICML), pp.282-289, 2001 

199. S. Lakshmanan, H. Derin, "Gaussian Markov random fields at multiple resolutions," in 
Markov Random Fields - Theory and Application (R. Chellappa, A. Jain (Eds)), pp. 131- 
157, 1993 

200. P. Lancaster, L. Rodman, Algebraic Riccati Equations, Oxford University Press, 1995 

201. K. Lee, C. Kuo, "Shape from shading with a linear triangular element surface model," 
IEEE Trans. PAMI (15) #8, pp.815-822, 1993 

202. D. Lee, J. Shiau, "Thin plate splines with discontinuities and fast algorithms for their 
computation," SI AM J. Scientific Computing (15) #6, pp.1311-1330, 1994 



References 445 

203. P. LeTraon, P. Gaspar, F. Bouyssel, H. Makhmara, "Using Topex/Poseidon data to en- 
hance ERS-1 data," /. Atmospheric and Oceanic Technology (12), pp. 161-170, 1995 

204. S. Levitus, Climatological Atlas of the World Ocean, United States Government Printing, 
1982 

205. J. Li, R. Gray, Image Segmentation and Compression Using Hidden Markov Models, 
Kluwer Academic, 2000 

206. J. Li, A. Najmi, R. Gray, "Image classification by a two dimensional hidden Markov 
model," IEEE Trans. Signal Processing (48) #2, pp.5 17-533, 2000 

207. S. Li, Markov Random Field Modeling in Computer Vision, Springer, 2001 

208. Z. Liang, C. Fernandes, F. Magnani, P. Philippi, "A reconstruction technique for three- 
dimensional porous media using image analysis and Fourier transforms," /. Petroleum 
Science and Engineering (21) #3-4, pp. 273-283, 1998 

209. L. Liang, C. Liu, Y. Xu, B. Guo, H. Shum, "Real-time texture synthesis by patch-based 
sampling," ACM Trans. Graphics (20) #3, pp.150, 2001 

210. J. Lim, Two-Dimensional Signal and Image Processing, Prentice-Hall, 1990 

211. J. Liu, "Computational models and task scheduling for parallel sparse Cholesky factor- 
ization," Parallel Computing (3), pp.327-342, 1986 

212. L. Ljung, System Identification : Theory for the User, Prentice-Hall, 1987 

213. M. Luettgen, W. Karl, A. Willsky, R. Tenney, "Multiscale representations of Markov 
random fields," IEEE Trans. Signal Processing (41) #12, pp.3377-3396, 1993 

214. M. Luettgen, W. Karl, A. Willsky, "Efficient multiscale regularization with applications 
to the computation of optical flow," IEEE Trans. Image Processing (3) #1, pp.4 1-64, 
1994 

215. M. Luettgen, A. Willsky, "Multiscale smoothing error models," IEEE Trans. Automatic 
Control (40) #1, 1995 

216. T. Lundahl, W Ohley, S. Kay, R. Siffert, "Fractional Brownian motion: A maximum 
likelihood estimator and its application to image texture," IEEE Trans. Medical Imaging 
(5),pp.l52-161, 1986 

217. J. Luo, C. Guo, "Perceptual grouping of segmented regions in color images," Pattern 
Recognition (36), pp.278 1-2792, 2003 

218. M. Lustig, D. Donoho, J. Pauly, "Sparse MRI: The application of compressed sensing 
for rapid MR imaging," Magnetic Resonance in Medicine (58) #6, pp.1 182-1 195, 2007 

219. P. Malanotte-Rizzoli, "Data assimilation: fundamentals, global and Mediterranean ex- 
amples," in P. Malanotte-Rizzoli, A. Robinson (Eds.), Ocean Processes in Climate Dy- 
namics: Global and Mediterranean Examples , NATO Asi Series (419), 1994 

220. F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, P. Suetens, "Multimodality image 
registration by maximization of mutual information," IEEE Trans. Medical Imaging (16) 
#2,pp.l87-198, 1997 

221. S. Mallat, "A theory of multiresolution signal decomposition: The wavelet representa- 
tion," IEEE Trans. RAMI (1 1) #7, pp.674-693, 1989 

222. S. Mallat, S. Zhong, "Characterization of signals from multiscale edges," IEEE Trans. 
RAMI (14) #9,pp.710-732, 1992 

223. B. Mandelbrot, J. van Ness, "Fractional Brownian motions, fractional noises and appli- 
cations," SIAM Review (10), pp.422-437, 1968 

224. B. Mandelbrot, The Fractal Geometry of Nature, WH. Freeman & Co., 1982 

225. B. Manjunath, T. Simchony, R. Chellappa, "Stochastic and Deterministic Networks 
for Texture Segmentation," IEEE Trans. Acoustics, Speech, Signal Processing (38) #6, 
pp. 1039-1049, 1990 

226. B. Manjunath, R. Chellappa, "Unsupervised texture segmentation using Markov random 
field models," IEEE Trans. RAMI (13) #5, pp.478-482, 1991 



446 References 

227. G. Matheron, Les Variables Regionalisees et Leur Estimation, Masson, 1965 

228. S. McCormick, Multigrid methods, SIAM, 1987 

229. S. McCormick, Multilevel Adaptive Methods for Partial Differential Equations, SIAM, 
1989 

230. D. Melas, S. Wilson, "Double Markov random fields and Bayesian image segmentation," 
IEEE Trans. Signal Processing (50) #2, pp.357-365, 2002 

23 1 . J. Mendel, Lessons in Estimation Theory for Signal Processing, Communications, and 
Control, Prentice-Hall, 1995 

232. D. Menemenlis, P. Fieguth, C. Wunsch, A. Willsky, 'Adaptation of a fast optimal in- 
terpolation algorithm to the mapping of oceanographic data," // Geophysical Research 
(102) #C5, pp. 10573-10584, 1997 

233. C. Meyer, Matrix Analysis and Applied Linear Algebra, SIAM, 2001 

234. M. Mignotte, C. Collet, P. Perez, P. Bouthemy, "Sonar image segmentation using an 
unsupervised hierarchical MRF model," IEEE Trans. Image Processing (9) #7, pp. 121 6- 
1231,2000 

235. M. Mignotte, "Nonparametric multiscale energy-based model and its application in 
some imagery problems," IEEE Trans. PAMI (26) #2, pp.184-197, 2004 

236. A. Mohebi, P. Fieguth, M. Ioannidis, "Statistical fusion of two-scale images of porous 
media," Advances in Water Resources (32), pp. 1567-1579, 2009 

237. E. Simoncelli, P. Muller, B. Vidakovic (Eds), Bayesian Inference in Wavelet-Based 
Methods, Springer, 1999 

238. W. Munk, P. Worcester, C. Wunsch, Ocean Acoustic Tomography, Cambridge University 
Press, 1995 

239. K. Nagpal, R. Helmick, C. Sims, "Reduced-order estimation Part 1: Filtering," Int. J. 
Control (45) #6, pp.1867-1888, 1987 

240. R. Nash, S. Jordan, "Statistical geodesy - an engineering perspective," Proc. IEEE (66) 
#5, pp.532-550, 1978 

241. T. Ojala, M. Pietikinen, D. Harwood, "A comparative study of texture measures with 
classification based on feature distributions," Pattern Recognition (29), pp.5 1-59, 1996 

242. T. Ojala, M. Pietikainen, T. Maenpaa, "Multiresolution gray-scale and rotation invariant 
texture classification with local binary patterns," IEEE Trans. PAMI (24) #7, pp.97 1- 
987, 2002 

243. A. Oppenheim, R. Schafer, Discrete-time signal processing (3rd ed.), Prentice-Hall, 
2009 

244. A. Oppenheim, A. Willsky, H. Nawab, Signals & Systems, Prentice-Hall, 1997 

245. M. Ortner, X. Descombes, J. Zerubia, "Building outline extraction from digital elevation 
models using marked point processes," Int. J. Computer Vision (72) #2, pp. 107-132, 
2007 

246. M. Ortner, X. Descombes, J. Zerubia, "A marked point process of rectangles and seg- 
ments for automatic analysis of digital elevation models," IEEE Trans. PAMI (30), 
pp. 105-1 19, 2009 

247. V. Pan, How to Multiply Matrices Easter, Springer, 1984 

248. A. Papoulis, S. Pillai, Probability, Random Variables, and Stochastic Processes, Mc- 
Graw Hill, 2002 

249. J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann Publishers, 
1988 

250. K. Pearson, "On lines and planes of closest fit to systems of points in space," Philosoph- 
ical Magazine (2), pp.559-572, 1901 

251. S. Peleg, G. Ron, "Nonlinear multiresolution: a shape from shading example," IEEE 
Trans. PAMI (12) #12, pp.1206-1210, 1990 



References 447 

252. D. Percival, A. Walden, Spectral Analysis for Physical Applications, Cambridge Univer- 
sity Press, 1993 

253. A. Pentland, B. Moghaddam, T. Starner, "View-based and modular eigenspaces for face 
recognition," CVPR, pp. 84-91, 1994 

254. P. Perez, F. Heitz, "Restriction of a Markov random field on a graph and multiresolution 
statistical image modeling," IEEE Trans. Information Theory (42)#1, pp. 180-190, 1996 

255. P. Perez, "Markov random fields and images," CWI Quarterly (1 1) #4, pp.413-437, 1998 

256. P. Perez, J. Vermaak, A. Blake, "Data fusion for visual tracking with particles," Proc. 
IEEE (92) #3, pp.495-513, 2004 

257. K. Petersen, M. Pedersen, The Matrix Cookbook, Technical University of Denmark, 
2008 

258. H. Permuter, J. Francos, I. Jermyn, "A study of Gaussian mixture models of color and 
texture features for image classication and segmentation," Pattern Recognition (39), 
pp.695-706, 2006 

259. M. Petrou, P. Sevilla, Dealing with Texture, Wiley, 2006 

260. W. Pieczynski, A. Tebbache, "Pairwise Markov random fields and segmentation of tex- 
tured images," Machine Graphics & Vision (9) #3, pp.705-718, 2000 

261. J. Pitman, Probability, Springer, 1993 

262. J. Portilla, E. Simoncelli, "A parametric texture model based on joint statistics of com- 
plex wavelet coefficients," Int. J. Computer Vision (40) #1, pp.49-70, 2000 

263. J. Portilla, V. Strela, M. Wainwright, E. Simoncelli, "Image denoising using scale mix- 
tures of Gaussians in the wavelet domain," IEEE Trans. Image Processing (12) #11, 
pp.1338-1351,2003 

264. W. Press, S. Teukolsky, W. Vetterling, B. Flannery, Numerical Recipes: The Art of Sci- 
entific Computing, Cambridge University Press, 2007 

265. L. Rabiner, "A tutorial on hidden Markov models and selected applications in speech 
recognition," Proc. IEEE (77), pp.257-285, 1989 

266. H. Rauch F. Tung, C. Striebel, "Maximum likelihood estimates of linear dynamic sys- 
tems,"A/AA Journal, (3) #8, 1965 

267. J. Rehg, Visual Analysis of high DOF Articulated Objects with Application to Hand 
Tracking, PhD thesis, Carnegie Mellon University, 1995 

268. J. Reid, "On the method of conjugate gradients for solving linear systems," In J. Reid 
(Ed.), Large Sparse Sets of Linear Equations, Academic Press, 1971 

269. D. Reid, "An algorithm for tracking multiple targets," IEEE Trans. Automatic Control 
(24) #6, pp.843-854, 1979 

270. B. Ripley, Statistical Inference for Spatial Processes, Wiley, 1988 

271. B. Ripley, Spatial Statistics, Wiley, 1991 

272. B. Ristic, S. Arulampalam, N. Gordon, Beyond the Kalman Filter: Particle Filters for 
Tracking Applications, Artech House, 2004 

273. V. Rokhlin, "Rapid solution of integral equations of classical potential theory," /. Com- 
putational Physics (60), pp. 187-207, 1985 

274. J. Romberg, H. Choi, R. Baraniuk, "Bayesian tree- structured image modeling us- 
ing wavelet-domain hidden Markov models," IEEE Trans. Image Processing (10) #7, 
pp. 1056-1068, 2001 

275. O. Ronen, J. Rohlicek, M. Ostendorf, "Parameter estimation of dependence tree models 
using the EM algorithm," IEEE Signal Processing Letters (2) #8, 1995 

276. H. Rue, L. Held, Gaussian Markov Random Fields: Theory and Applications, CRC 
Press, 2005 

277. Y. Saad, M. Schultz, "GMRES: A generalized minimal residual algorithm for solving 
nonsymmetric linear systems," SI AM J. Sci. Stat. Comput. #7, pp. 856-869, 1986 



448 References 

278. Y. Saad, Iterative methods for sparse linear systems (2nd ed.), SIAM, 2003 

279. D. Saupe, "Algorithms for Random Fractals," in H. Peitgen, D. Saupe (Eds), The Science 
of Fractal Images Springer, 1988 

280. R. Schalkoff, Pattern recognition : statistical, structural, and neural approaches, Wiley, 
1992 

281. M. Schroeder, Fractals, Chaos, Power Laws, Freeman, 1991 

282. M. Schwartz, J. Barrett, P. Fieguth, P. Rosenkranz, M. Spina, D. Staelin, "Observations 
of thermal and precipitation structure in a tropical cyclone by means of passive mi- 
crowave imagery near 118 GHz," Journal of Applied Meteorology (35) #5, pp.67 1-678, 
1996 

283. G. Shafer, A Mathematical Theory of Evidence, Princeton University Press, 1976 

284. K. Shanmugan, A. Breipohl, Random Signals, Wiley, 1988 

285. J. Shewchuk, "An introduction to the conjugate gradient method without the agonizing 
pain," (unpublished), 1994 

286. E. Simoncelli, W Freeman, E. Adelson, D. Heeger, "Shiftable multi-scale transforms," 
IEEE Trans. Information Theory (38) #2, 587-607, 1992 

287. D. Simon, Optimal state estimation: Kalman, Hoo and nonlinear approaches, Wiley - 
Interscience, 2006 

288. S. Sinha, B. Schunck, "A two stage algorithm for discontinuity detection," IEEE Trans. 
PAMI (14) #1, pp.36-55, 1992 

289. I. Sobey, Numerical Computation Notes (unpublished), 2000 

290. M. Srinath, P. Rajasekaran, R. Viswanathan, An introduction to statistical signal pro- 
cessing with applications, Prentice-Hall, 1996 

291. J. Starck, F. Murtagh, A. Bijaoui Image Processing and Data Analysis, The Multiscale 
Approach, Cambridge University Press, 1998 

292. R. Stoica, X. Descombes, J. Zerubia, "A Gibbs point process for road extraction from 
remotely sensed images," Int. J. Computer Vision (57) #2, pp. 121-136, 2004 

293. G. Strang, T. Nguyen, Wavelets and Filter Banks (2nd Ed.), Wellesley College, 1996 

294. V. Strassen, "Gaussian elimination is not optimal," Numer. Math (13), pp.354-356, 1969 

295. C. Sutton, A. McCallum, "An introduction to conditional random fields for relational 
learning," Introduction to Statistical Relational Learning, MIT Press, 2007 

296. M. Swain, D. Ballard, "Color indexing," Int. J. Computer Vision (7), pp. 11-32, 1991 

297. R. Swendsen, J. Wang, "Nonuniversal critical dynamics in Monte Carlo simulation," 
Physical Review Letters (58) #2, pp.86-88, 1987 

298. R. Szeliski, Bayesian Modeling of Uncertainty in Low-level Vision, Kluwer Academic, 
1989 

299. R. Szeliski, "Fast surface interpolation using hierarchical basis functions," IEEE Trans. 
PAMI (12) #6, pp.513-528, 1990 

300. R. Szeliski, D. Tonnesen, "Surface modeling with oriented particle systems," Computer 
Graphics (26) #2, pp.185-194, 1992 

301. A. Tarantola, Inverse Problem Theory and Methods for Model Parameter Estimation, 
SIAM, 2004 

302. D. Taubman, M. Marcellin, M. Rabbani, "JPEG2000: Image compression fundamentals, 
standards and practice," /. Electronic Imaging (11), pp.286, 2002 

303. D. Terzopoulos, "Multilevel computer processes for visual surface reconstruction," 
Computer Vision, Graphics, Image Processing (24), pp. 52-96, 1983 

304. D. Terzopoulos, "Regularization of inverse visual problems involving discontinuities," 
IEEE Trans. PAMI (8) #4, pp.4 13-424, 1986 

305. D. Terzopoulos, R. Szeliski, "Tracking with Kalman snakes," In Active Vision, MIT 
Press, pp.3-20, 1993 



References 449 

306. A. Tewfik, M. Kim, "Correlation structure of the discrete wavelet coefficients of frac- 
tional Brownian motion," IEEE Trans. Information Theory (38) #2, pp. 904-909, 1992 

307. S. Thrun, Y. Liu, D. Koller, A. Ng, Z. Ghahramani, H. Durrant-Whyte, "Simultaneous 
localization and mapping with sparse extended information filters," Int. J. Robotics Re- 
search (23) #7-8, pp.693, 2004 

308. A. Tikhonov, V. Arsenin, Solutions of Ill-Posed Problems, Winston, 1977 

309. A. Tikhonov, A. Goncharski, V. Stepanov, I. Kochikov, "Ill-posed image processing 
problems," Soviet Physics - Doklady (32), pp.456-458, 1987 

310. A. Tikhonov em et al., Numerical Methods for the Solution of Ill-Posed Problems, 
Kluwer Academic Publishers, 1995 

311. M. Tippett, J. Anderson, C. Bishop, T. Hamill, J. Whitaker, "Ensemble square root fil- 
ters," Monthly Weather Review 131, pp.1485-1490, 2003 

312. S. Torquato, Random Heterogeneous Materials: Micro structure and Macroscopic Prop- 
erties, Springer, 2002 

313. N. Trefethen, D. Bau, Numerical Linear Algebra, SIAM, 1997 

314. A. Tremeau, N. Borel, "A region growing and merging algorithm to color segmentation," 
Pattern Recognition (30) #7, pp. 1191-1203, 1997 

315. M. Varma, A. Zisserman, "A statistical approach to material classification using image 
patches," IEEE Trans. RAMI (31) #11, pp.2032-2047, 2009 

316. O. Vasilyev, N. Kevlahan, "An adaptive multilevel wavelet collocation method for ellip- 
tic problems," /. Computational Physics (206) #2, pp.4 12-431, 2005 

317. B. Vidakovic, Statistical Modeling by Wavelets, Wiley, 1999 

318. C. Vieren, F. Cabestaing, J. Postaire, "Catching moving objects with snakes for motion 
tracking," Pattern Recognition Letters (16) #7, pp.679-685, 1995 

319. P. Viola, W Wells, "Alignment by maximization of mutual information," Int. J. Com- 
puter Vision (24) #2, pp. 137-154, 1997 

320. A. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum de- 
coding algorithm," IEEE Trans. Information Theory (13) #2, pp.260-269, 1967 

321. H. van der Vorst, Iterative Krylov Methods for Large Linear Systems, Cambridge Uni- 
versity Press, 2003 

322. G. Wahba, "Practical approximate solutions to linear operator equations when the data 
are noisy," SIAM J. Numerical Analysis (14), 1977 

323. G. Wahba, "Bayesian "Confidence Intervals" for the Cross- Validated Smoothing 
Spline," /. Royal Statistical Society B (45), pp. 133-150, 1983 

324. G. Wahba, Spline Models for Observational Data, SIAM Series in Applied Mathematics 
#59, SIAM, 1990 

325. E. Wan, R. van der Merwe, A. Nelson, "Dual estimation and the unscented transforma- 
tion," in Advances in Neural Information Processing Systems 12 (S. Solla, T. Leen, K. 
Mller, Eds.), MIT Press, pp.666-672, 2000 

326. E. Wan, R. van der Merwe, "The unscented Kalman filter," in Kalman filtering and 
Neural Networks, Wiley, pp.221-280, 2001 

327. Y. Wang, Q. Ji, "A dynamic conditional random field model for object segmentation in 
image sequences," CVPR, 2005 

328. E. Wegman, D. DePrest, Statistical Image Processing and Graphics Marcel Dekker, 
1986 

329. S. Wesolkowski, P. Fieguth, "Hierarchical regions for image segmentation," ICIAR, Por- 
tugal, 2004 

330. P. Wesseling, An Introduction to Multigrid Methods, Wiley, 1991 

331. G. Whitten, "Scale space tracking and deformable sheet models for computational vi- 
sion," IEEE Trans. RAMI (15) #7, 1993 



450 References 

332. K. Wilson, "Problems in physics with many scales of length," Scientific American (241), 
pp. 158-179, 1979 

333. A. Willsky, 6.433 - Recursive Estimation, Department of Electrical Engineering & 
Computer Science, Massachusetts Institute of Technology, 1992 

334. A. Willsky, "Multiresolution Markov models for signal and image processing," Proc. 
IEEE (90) #8, pp. 1396-1458, 2002 

335. G. Winkler, Image Analysis, Random Fields, and Dynamic Monte Carlo Methods (2nd 
ed.), Springer, 2003 

336. C. Won, R. Gray, Stochastic Image Processing, Springer, 2004 

337. J. Woods, C. Radewan, "Kalman filtering in two dimensions," IEEE Trans. Information 
Theory (23), pp.437-481, 1977 

338. J. Woods, V. Ingle, "Kalman filtering in two dimensions: Further results," IEEE Trans. 
Acoustics, Speech, Signal Processing (29), pp. 188-197, 1981 

339. G. Wornell, A. Oppenheim, "Estimation of fractal signals from noisy measurements 
using wavelets," IEEE Trans. Signal Processing (40), pp.61 1-623, 1992 

340. G. Wornell, "Wavelet-based representation for the 1/f family of fractal processes," 
Proc. IEEE, 1993 

341. C. Wu, Peter C. Doerschuk, "Tree approximations to markov random fields," IEEE 
Trans. RAMI (17) #4, pp.391-402, April 1995 

342. C. Wunsch, E. Gaposchkin, "On using satellite altimetry to determine the general circu- 
lation of the oceans with application to geoid improvement," Reviews of Geophysics and 
Space Physics (18) #4, pp.725-745, 1980 

343. C. Wunsch, "Sampling characteristics of satellite orbits," /. Atmospheric and Oceanic 
Tech. (6)#6, pp.891-907, 1989 

344. R. Yager, M. Fedrizzi, J. Kacprzyk (Eds.), Advances in the Dempster-Shafer Theory of 
Evidence, Wiley, 1994 

345. M. Yaou, W. Chang, "Fast surface interpolation using multiresolution wavelet trans- 
form," IEEE Trans. RAMI (16) #7, pp.673-688, 1994 

346. D. Young, Iterative Solution of Large Linear Systems, Academic Press, 1971 

347. H. Yserentant, "On the multi-level splitting of finite element spaces," Numerische Math- 
ematik (49) #4, pp.379-412, 1986 

348. H. Yserentant, "Two preconditioners based on the multilevel splitting of finite element 
spaces," Numerische Mathematik (58) #2, pp. 163-184, 1990 

349. Q. Zheng, R. Chellappa, "Estimation of illuminant direction, albedo, and shape from 
shading," IEEE Trans. RAMI (13) #7, pp.680-702, 1991 



Index 



ABCD lemma 68, 99, 388 

Banded matrices see Sparsity 
Basis 

Change see Change of basis 

Definition 384 

Reduction see Dimensionality reduction 
Bayesian problems 

Approximate 70 

Definition 3 1 

Dynamic 42, 86 

Estimation 64 

Least squares 45, 65 

Linear least squares 45, 66 

MAP 45 

Regularization 37 

Static 40, 67 
Boundary conditions 153 

Canonical problems 

Data fusion 41 

Dynamic 42 

Static 40 
Change of basis 241 

Explicit 244 

Hierarchical 269, 276 

Implicit 245 

Kernels 254 

Orthogonal 245 
Cholesky decomposition 295, 401 , 407 
Circulance 262, 392 

Circular convolution 427 
Condition number 27, 401 
Conditional random fields 225 



Conditioning 23, 25, 27, 206, 241, 310 
Conjugate gradient 306 
Convolution 145, 424, 427 

Circular 427 
Coupling 135,242,281 
Covariance matrices 37, 70, 389, 419, 420 

Approximated 167 

Definition 414 

Diagonal 418 

Eigendecomposition 399 

Inverse and Markovianity 188 

Positive-definite 389, 390 
Cross validation 37, 38, 83 

Data fusion 16,41,76,83,377 
Deterministic problems see Non-Bayesian 

problems 
Dimensionality reduction 135, 247, 250, 

253 
Duality 

Bayesian-non-Bayesian 39, 73 

Prior-measurement 73 

Static-dynamic 89, 326 
Dynamic problems 

Canonical 42, 86 

Estimation 85, 97, 100 

Sampling 118,357 

Eigendecompositions 248, 262, 304, 305, 

396, 420 
Estimation 46, 69 

Approximate 70 

Bayesian 40, 45, 64 

Definition 44 



451 



452 



Index 



Discrete state 119 

Dynamic 42,85,97,100 

Non-Bayesian 48, 58 

Static 40, 57 

Using Fourier transform 266 
Expectation 64, 186, 41 1 
Expectation-Maximization (EM) 232 

Forward problems 14, 78 
Fourier transform 262, 361 , 428 

Gauss-Markov processes 88, 181 
Gauss-Markov random fields 185, 189 
Gaussian 419 

Multivariate distribution 417 

Univariate distribution 412 
Gaussian elimination 295, 402 
Gaussian transformation 402 
Gibbs random fields 192 

Comparison to Markov 194 

Ising 228, 283 

Sampling 366 
Gram-Schmidt 309, 404 
Graphical coupling 135, 242, 28 1 

Hidden Markov models 215 

Modelling 231 
Hierarchical methods 

Change of basis 269, 276 

Discrete state fields 28 1 , 373 

Interpolated bases 272 

Markov random fields 278 

MCMC 370 

Multigrid 313 

Multiscale 339, 363 

Nested dissection 296 

Overview 137 

Wavelets 273 

Ill-Conditioned problems see Conditioning 
Ill-Posed problems see Posedness, Inverse 

problems 
Image processing 22, 423 

Blurring 21,24,425 

Convolution 424, 427 

Denoising 29, 216, 347, 431 

Edge detection 22 1 , 425 , 43 1 

Fourier transform 428 

Segmentation 219, 233, 432 



Wavelet transform 273 , 428 
Interpolation 9 

Cross validation 38 

Dynamic 98,112 

Iterative 302 

Multidimensional 159 

Posedness 20 

Regularization 32 
Inverse problems 2,13 

Bayesian see Bayesian problems 

Canonical see Canonical problems 

Conditioning 23, 25 

Dynamic 42 

Non-Bayesian see Non-Bayesian 
problems 

Posedness 19, 30 

Regularization 29 

Static 40 
Iterative solvers see Linear systems 
methods 

Kalman filter 97, 98, 100, 325, 330 

Derivation 93 

Information form 102 

Marching 327, 332 

Nonlinear 114 

Reduced order 338 

Reduced update 336 

Smoother 109,112,331 

Square root 103 

Steady state 107, 334 

Stripped 334 
Kernels see Sparsity 
Kriging 45, 164 

Least squares 30, 48, 62, 409 

Bayesian 45, 65 

Linear 58 

Pseudoinverse 61 

Regularized 61 

Weighted 48, 61 
Lexicographic ordering 133 
Linear dependence 19, 384 
Linear regression 23, 62 
Linear systems 245, 293 

Overdetermined 22, 409 

Underdetermined 22, 409 
Linear systems methods 293 

Cholesky decomposition 295,401 
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Conjugate gradient 306 
Gauss-Jacobi 299 
Gauss-Seidel 299, 302 
Gaussian elimination 295, 402 
LU decomposition 403 
Multigrid 313 
Nested dissection 296 
Preconditioning 310 
Successive overrelaxation (SOR) 303 
LU decomposition 403 

Marching methods 327, 362 
Markov chains 119,181 
Markovianity 179, 222, 297, 327 

Gauss-Markov process 88, 181 

Gauss-Markov random field 185 

Hidden 215, 223 

Markov random field 182 

One-dimensional 1 80 

Sparsity 188 
Matlab XV, 10 
Matrix 

Block-matrix identities 387 

Circulance see Circulance 

Condition number 27,401 

Conditioning 23,27,310 

Covariance see Covariance 

Derivatives 394, 395 

Identities 387 

Inequality 389 

Jacobian 394 

Kernels see Sparsity 

List of types 392,393 

Null space 19, 384 

Positivity see Positive-definiteness 

Pseudoinverse 409 

Range space 19, 385 

Rank 19, 385 

Sparse see Sparsity 

Square root 166,401,407 
Matrix transformations 395 

Cholesky decomposition 295, 401 

Eigendecomposition 396 

Gaussian 402 

Gaussian elimination 295, 402 

Gram-Schmidt 404 

LU decomposition 403 

QR 104,406 

Square root 407 



Maximum a Posteriori (MAP) 45 
Maximum likelihood (ML) 48, 63 
MCMC see Monte Carlo 
Membrane prior 35, 150 
Model inference 169, 170, 199, 206 

Parameter estimation 50 
Modelling 148 

Bayesian 158 

Boundary conditions 153 

Dynamic 166,325 

Hidden fields 231 

Markov random field 199, 204 

Multidimensional 134 

Non-Bayesian 149 

Nonstationary 164 

Representation 1 72, 207 
Monte Carlo 355, 365 
Multigrid 313 
Multiscale 339, 363 

Non-Bayesian problems 

Definition 3 1 

Estimation 58 

Maximum likelihood 63 

Modelling 149 

Regularization 34, 61 
Normal distribution see Gaussian 
Null space 19, 384 

Orthogonality 59, 65 

Parameter estimation see Model inference 
Posedness 23, 30 
Positive-definiteness 388 

Analytical forms 160 

Definition 392 

Intuition 27, 390 
Posterior sampling see Sampling 
Preconditioning 310 
Principal components 248, 252 
Prior 

Boundaries 153 

Membrane 35, 150 

Model, Bayesian 37, 158 

Model, non-Bayesian 34, 150 

Sampling see Sampling 

Thin-plate 35, 150 

Zero mean 40, 68, 71 
Prior-Free problems see Non-Bayesian 
problems 
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Pseudoinverses 61,247,409 
QR decomposition 104, 406 

Random fields 6 

Causal 189 

Conditional 225 

Definition 415 

Discrete state 227 

Gibbs 192 

Markov 182 

Markov vs. Gibbs 194 

Modelling 199 

Noncausal 185 

Stationary 415 
Random variables 411 

Correlation 413 

Independence 412 
Random vectors 414 

Covariance 414 

Transformation 416 
Range space 19, 385 
Rank 

Full column 19,385,409 

Full row 19,385 
Regularization 29 

Bayesian 37 

Non-Bayesian 34 

Tikhonov 30 

Sampling 355,362,374,408 
Discrete state 366, 370 
Dynamic 118,357 
Examples 46, 75, 92 
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Prior 42, 74, 374 

Static 74, 358 

Using Fourier transform 266 
Simulated annealing 366, 369, 370, 372 
Singular value decomposition 400 
Singular value decompositions 27 
Sparsity 

Banded matrices 139,144 

Computational complexity 144 

Markovianity 188 

Matrix kernels 141,146,151 

Sparse matrices 139 
Square root matrices 103, 401 , 407 
Static problems 

Bayesian 67 

Canonical 40 

Estimation 57 

Non-Bayesian 58 

Sampling 74, 358 
Stationarity 141, 160, 164, 415 
SVD see Singular value decomposition 

Thin-plate prior 35, 150 
Tikhonov regularization 30 
Tomography 50 

Validation 35 
Variogram 45 
Viterbi 120 

Wavelet transform 273, 347, 428 
Wavelets and statistics 275 



